Re: Solidus variations

2011-10-07 Thread Asmus Freytag
Murray's work comes from the desire to represent mathematical equations 
faithfully, based nearly entirely on the semantics of the operators and 
having those operators be represented as Unicode characters.


One solution that he uses is the use of redundant parens. Parens can 
be supplied to group operands, so that you get the correct precedence, 
but, where they are not necessary to the human reader, they will be 
dropped in the formatted equation.


As input format, the linear format, therefore looks more like current 
source code, in that one does type parens.


When fractions are built up, you don't need the parens, so they are 
dropped in layout. If you take the same fraction and display it inline 
(with a slash) some or all of the parens would be needed for the human 
reader as well, so those are displayed.


How would you parse 5.5 if input as a fraction? 51/2? You do need some 
form of grouping to recognize that the 5 and the 1 are not part of the 
same numerator.


A./



Re: Continue: Glaring mistake in nomenclature , should it have been Assamese ?

2011-09-14 Thread Asmus Freytag

On 9/14/2011 11:14 AM, Michael Everson wrote:

At this point, I think I have to make a plea: Sarasvati, spare us.

+1


Re: Need for Level Direction Mark

2011-09-13 Thread Asmus Freytag

On 9/13/2011 6:01 AM, Philippe Verdy wrote:

Unfortunately, adding controls would imply the creation of new Bidi
classes for them (and forgetting the stability policy about them,
which was published too soon before solving evident problems).


The first part is correct, and giving up stability to that degree would 
be a serious issue.


I disagree with the second part. True plaintext bidi will always be a 
compromise, because there's a lack of information on the intent of the 
writer. (In rich text, you can supply that with styles). There's a 
limited workaround with bidi controls, but that's beginning to be a form 
of minimal rich text in itself.


Stability is paramount for predictability. You need to be able to 
predict what your reader will see, and you will only be able to do that, 
when you can rely on all implementations agreeing on the details of how 
to lay out bidi.


Introducing any new feature now, will result in decades of 
implementations having different levels of support for it. This makes 
the use of such a new feature unpredictable - and is a problem whether 
there was a formal stability guarantee or not.


A./


Re: ligature usage - WAS: How do we find out what assigned code points aren't normally used in text?

2011-09-11 Thread Asmus Freytag

On 9/9/2011 8:12 PM, Stephan Stiller wrote:

Dear Martin,

Thanks for alerting me to the issue of causal direction of aesthetic 
preference - it's been on my mind, but your reply helps me sort out 
some details.


When I first encountered text (outside of the German language locale) 
with ample use of ligatures in modern printed text, I definitely found 
the ligatures a bit distracting, but partly just because I wasn't used 
to them. I also perceived them as a solution to what (in Germany) 
appeared to me to be a real non-issue.


Put simply, there is a conflict between full flexibility for font 
designs and the burden imposed by sophisticated ligatures and kerning 
tables.


From my background I never perceived a need, but I guess I (and most 
people??) wouldn't really mind the tradition coming back (in Germany) 
if things are designed well (which is the job of the font designer) 
and for the user everything is handled automatically in the background 
by the available technology ...


Which cannot happen for German, as it is one of the languages where the 
same letter pair may or may not have a ligature based on the *meaning* 
of the word - something that you can't automate.


We had famous discussions on this list on this subject. Take an st 
ligature. There are two meanings for the German word Wachstube, only 
one allows the st ligature. A human would have to decide when the 
ligature is appropriate. (Incidentally, the same goes for hyphenation 
for this word, one meaning allows a hyphen after the s the other does 
not).


Certain layout processes, in certain cases, in certain languages, simply 
can't be fully automated.


A./


Stephan








Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-09-01 Thread Asmus Freytag

On 8/31/2011 11:25 PM, Philippe Verdy wrote:

2011/9/1 Karl Williamsonpub...@khwilliamson.com:
But now that I'm an UTC member, I hope I will hear these cases earlier... 


Congratulations!


Does it justify so many new aliases at the same time ?


No. I'm firmly with you, I support the requirement for 1 (ONE) alias for 
control codes because they don't have names, but are used in 
environments where the need a string identifier other than a code point. 
(Just like regular characters, but even more so).


I also support the requirement for 1 (ONE) short identifier, for all 
those control AND format characters for which widespread usage of such 
an abbreviation is customary. (VS-257 does not qualify).


Further, I support, on a case-by-case basis the addition of duplicate 
aliases for reasons of compatibility. I would expect these 
compatibility requirements to be documented for each character in sort 
of proposal document, not just a list of entries in a draft property file.


Finally, I don't support using the name of any standard, iso or 
otherwise, as a label in the new status field. It sets the wrong precedent.


I've not checked the history of all past versions of UAX, UTR, and UTN 
(or even in the text of chapters of the main UTS)... Are there other 
cases in those past versions, that this PRI should investigate and 
track back ? 


My preference would be to start this new scheme of with a minimum of 
absolutely 100% required aliases. Anything even remotely doubtful should 
be removed for further study.


A./




Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-28 Thread Asmus Freytag

On 8/28/2011 9:46 PM, Doug Ewell wrote:

Philippe Verdy wrote:


If there are other mappings to do with other standards, and those
standards must be only informative, we already have the /MAPPINGS
directory beside the /UNIDATA directory where the UCD belongs too.


But in general, with the exception of MAPPINGS/VENDORS/MISC/SGML.TXT, 
the MAPPINGS directory isn't really a place for character *name* 
mappings.  It's primarily a place for *code point* mappings, for 
identifying U+0430 CYRILLIC SMALL LETTER A with 0xD0 in ISO 8859-5, 
and 0xC0 in Windows-1251, and 0xE0 in MacCyrillic, and 0xC1 in KOI8-R. 
Character names in other standards, like 'acy' for U+0430, are 
comparatively less important.


Right, however NAME mapping has not been a major issue - except for 
control codes, since Unicode did not name these, even though they were 
routinely named by people dealing with them.


It's really important to not jump off the deep-end and appear to create 
a precedent for name MAPPING across standards when what is desired is to 
have IDENTIFIERS for certain characters as well as SHORT IDENTIFIERS for 
characters very commonly referred to by identifier in source code 
(regular expressions, etc.).


A./




Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-28 Thread Asmus Freytag

On 8/28/2011 6:43 PM, Philippe Verdy wrote:

2011/8/27 Asmus Freytagasm...@ix.netcom.com:

I also think that the status field iso6429 is badly named. It should be
control, and what is named control should be control-alternate, or
perhaps, both of these groups should become simply control. I think the
labels chosen by the data file just set up bad precedents. If 6429, why not
a section for 9535 (or whatever the kbd standard is) etc.

Thanks a lot for admitting what I was trying to demonstrate to you in a
prior message (which was early dismissed as a complete non-starter).


You appeared to be making a non-starter proposal, rather than clearly 
making a hypothetical proposal designed only to showcase certain logical 
flaws in the PRI. If the latter was your intention, well we 
misunderstood you, but everybody seems to be on the same page, which is 
good.


I lso think that there are too many aliases for controls, if the only
need is for Perl to have a name to uniquely designate those controls.
Choose one alias name, but there's absolutely no emergency for now for
adding four aliases at once for them, when there's no demonstration
that all those aliases are needed! This is just unnecessary pollution
of the UCS namespace.


I tend to agree - however, I do think giving the common abbreviations 
some formal status is useful.


If I remember correctly, even in Perl there were some names that are 
legacy names. If programs other than Perl have an active need to support 
legacy names, then I would favor adding these one-by-one as demonstrated 
needs arise, but NOT wholesale, just because they existed in 6429 in 
some version.



Now, here's a subtle point: adding certain alias strings to the file is 
a cheap way for the editing tools that verify the uniqueness of the 
namespace to reserve a name (so it can't ever be given to a different 
character). Kind of like what happened to BELL. I bet a big motivation 
behind the long list (all for control codes) was to prevent any 
non-control code from ever getting a name that happens to match a known 
control code name.


While I appreciate that sentiment, I think this part of the proposal 
should not be rushed - aliases are forever, and warehousing all known 
obsolete names for control codes is a bit bizarre. I think you and I are 
possibly in agreement on that.




If there are other mappings ...


I've replied on the issue of mappings in reply to Doug's message.

A./



Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-27 Thread Asmus Freytag

On 8/26/2011 10:09 PM, Philippe Verdy wrote:

2011/8/27 Asmus Freytagasm...@ix.netcom.com:

I agree with Ken that Phillipe's suggestion of conflating the annotations
for mathematical use with formal Unicode name aliases is a non-starter.

Yes but why then adding ISO 6429 alias names ? What makes ISO 6429 a
better choice than another ISO standard, that you want to reject as a
non-starter option in the normative UCS namespace ?


Because, had the control codes been treated like normal characters, 
they would have named for their 6429 counterparts. For these characters, 
not having any formal identifiers in the standard created the problem 
that these aliases are now trying to fix.


And why dropping some naming rules for some the proposed alias names,
if this namespace also has normative rules ? If you want consistency,
those aliases could as well be informative only, and not part of the
UCS namespace, avoiding some of its restrictions, i.e. not defined in
the UCD itself but in a separate database.


I think, the naming rules issue is a bug. And needs to be dealt with in 
revising the draft. I've already answered that, but you haven't seen the 
answer come through on the list.


And you did not reply to the question about the stability of the
related standard using these aliases, compared to the stability
requirement for the UCS namespace: if there's no such stability, the
normative reference in the UCD will remain only informative for the
other standard, creating possible future conflicts.


This is no different than for character names derived from other 
standards. If those standards subsequently change designators, too bad. 
You misconstrue the issue slightly. Unicode would not make a normative 
reference, it would copy, once, a particular name and use it as an alias.


A./







Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-27 Thread Asmus Freytag

On 8/26/2011 7:52 PM, Benjamin M Scarborough wrote:

Are name aliases exempted from the normal character naming conventions? I ask 
because four of the entries have words that begin with numbers.

008E;SINGLE-SHIFT 2;control
008F;SINGLE-SHIFT 3;control
0091;PRIVATE USE 1;control
0092;PRIVATE USE 2;control


This is a good point.

While the restriction is silly, the character name matching rules 
disregard both hyphens and spaces.


Under the rules, the following strings are all equivalent

SINGLE-SHIFT1
SINGLE-SHIFT 1
SINGLE-SHIFT-1
SINGLE SHIFT 1
SINGLE SHIFT-1


(as would the dozens of permutations that introduced hyphens / spaces at 
other positions)


Given those matching rules, if the formal alias were to be

SINGLE-SHIFT-1

any programming environment could still recognize the name SINGLE-SHIFT 
1 as underneath, all would match to


SINGLESHIFT1

Perhaps the formal aliases should be corrected in the draft file to 
simply follow the established naming conventions, without introducing 
yet another level of exception.


A./



Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-27 Thread Asmus Freytag

On 8/27/2011 1:31 AM, Andrew West wrote:

On 27 August 2011 09:25, Andrew Westandrewcw...@gmail.com  wrote:

On 27 August 2011 03:52, Benjamin M Scarborough
benjamin.scarboro...@utdallas.edu  wrote:

Are name aliases exempted from the normal character naming conventions? I ask 
because four of the entries have words that begin with numbers.

008E;SINGLE-SHIFT 2;control
008F;SINGLE-SHIFT 3;control
0091;PRIVATE USE 1;control
0092;PRIVATE USE 2;control


ISO 6429 (and consequently ISO/IEC 10646 Section 11) calls these characters:
SINGLE-SHIFT TWO
SINGLE-SHIFT THREE
PRIVATE USE ONE
PRIVATE USE TWO

http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-048.pdf

Changing their names to SINGLE-SHIFT 2 or SINGLE-SHIFT-2 etc is
surely contrary to the whole point of the exercise.

Sorry, ignore that. I hadn't noticed that the digit forms were in
addition to the forms with numbers written as words.


Actually, you brought something to my attention that I had missed on 
reading the file, so I won't ignore this.


Having these ill-formatted names *in addition* to essentially the same 
name, but one that follows the naming conventions strikes me as silly. 
It would set a potential precedent for adding aliases for any character 
name containing either a digit or a the name for that digit. The PRI 
gives no rationale for the inclusion of names valid in earlier versions.


If there's a known deviation that is currently supported (as named 
character ID, such as in regular expressions) in widely distributed 
software, I would support the addition on compatibility grounds (with 
tweaks that follow the naming rules). But simply because a name existed 
once (but was later deprecated) strikes me as going into the same 
encyclopedic direction that Ken himself has disavowed.


I do think now that grouping the file is a bad idea, because several 
people in this discussion, myself included, missed these particular near 
duplicates. The natural thing is wanting to know all names/aliases for a 
character. If someone needs grouping for some purposes, a spreadsheet or 
other tool can easily be used to filter by status field.


I also think that the status field iso6429 is badly named. It should 
be control, and what is named control should be control-alternate, 
or perhaps, both of these groups should become simply control. I think 
the labels chosen by the data file just set up bad precedents. If 6429, 
why not a section for 9535 (or whatever the kbd standard is) etc.


A./



Re: PRI #202: Extensions to NameAliases.txt for Unicode 6.1.0

2011-08-26 Thread Asmus Freytag
I agree with Ken that Phillipe's suggestion of conflating the 
annotations for mathematical use with formal Unicode name aliases is a 
non-starter. The former exist to help mathematicians identify symbols in 
Unicode, when they know their name from entity lists. The latter are 
designed to allow programmers to support identifiers that match existing 
usage -- mainly for characters for which there currently is not any well 
defined ID, or for characters for which their abbreviated name is their 
de-facto name.


In a limited number of cases, that would lead to multiple aliases for 
the same character. The ideal is, as always, to have single identifiers 
per character, where possible. In a few exceptional cases, allowing 
alternate IDs via the NameAlias technique is of such overwhelming 
practical use to support an exception.


Aliases come from the same namespace as character names, and must be 
unique, so that they can be used to unambiguously identify a character. 
They are intended to be used in programmatic interfaces, for example 
regular expressions. Adding redundant identifiers comes at a cost: all 
implementations have to rev their name tables, and using recently added 
aliases might not be portable until all implementations have caught up. 
That's why proposals to add additional aliases to any *existing* 
character should have to pass a really high bar. (I find the rationale 
for this initial expansion well thought ought and defensible - leaving 
the control codes unnamed in 10646 has proven problematic to implementers).


There's no strict limit to *informative* aliases for characters, nor is 
there a uniqueness requirement. If there are important real world 
designations under which certain characters are known, they could be 
documented with informative aliases. These informative aliases are then 
available to user interface designers who wish to support a search for 
character by name feature. Unlike the case for program source code, 
such interfaces can handle multiple hits for the same name - by 
presenting a list, for example.


Utlimately, even in this case, some annotations are better presented in 
special purpose files than informative records in the nameslist. That 
was done for mathematics. If there are other fields where there were 
established conventions for naming symbols, perhaps someone could 
provide an analogous list - but it should have no bearing on the PRI 
under consideration.


A./



Re: Code pages and Unicode

2011-08-25 Thread Asmus Freytag

On 8/24/2011 7:45 PM, Richard Wordingham wrote:


Which earlier coding system supported Welsh?  (I'm thinking of 'W WITH
CIRCUMFLEX', U+0174 and U+0175.)  How was the use of the canonical
decompositions incompatible with the character encodings of legacy
systems?  Latin-1 has the same codes as ISO-8859-1, but that's as far
as having the same codes goes. Was the use of combining jamo
incompatible with legacy Hangul encodings?


See, how time flies.

Early adopters were interested in 1:1 transcoding, using a single 256 
entry table for an 8-bit character set, with guaranteed predictable 
length. Early designs of Unicode (and 10646) attempted to address these 
concerns, because they promised severe impediments to migration.


Some characters were included as part of the merger, without the same 
rigorous process as is in force for characters today. At that time, 
scuttling the deal over a few characters here or there would not have 
been a reasonable action. So you will always find some exceptions to 
many of the principles - which doesn't make them less valid.


Obviously D800 D800 000E DC00 is non-conformant with current UTF-16. 
Remembering that there is a guarantee that there will be no more 
surrogate points, an extension form has to be non-conformant with 
current UTF-16! 


And that's the reason why there's no interest in this part of the 
discussion. Nobody will need an extension next Tuesday, or in a decade 
or even in several decades - or ever. Haven't seen an upgrade to Morse 
code recently to handle Unicode, for example. Technology has a way of 
moving on.


So, best thing is to drop this silly discussion, and let those future 
people that might be facing a real *requirement* use their good judgment 
to come to a technical solution appropriate to their time - instead of 
wasting collective cycles of discussion how to make 1990's technology 
work for an unknown future requirement. It's just bad engineering.

Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit
range.


I disagree (as would anyone with a bit of long-term perspective). Nobody 
needs to look into this for decades, so let it rest.


A./



Re: Designing a format for research use of the PUA in a RTL mode (from Re: RTL PUA?)

2011-08-23 Thread Asmus Freytag

On 8/23/2011 7:22 AM, Doug Ewell wrote:

Of all applications, a word processor or DTP application would want to
know more about the properties of characters than just whether they are
RTL.  Line breaking, word breaking, and case mapping come to mind.

I would think the format used by standard UCD files, or the XML
equivalent, would be preferable to making one up:





The right answer would follow the XML format of the UCD.

That's the only format that allows all necessary information contained 
in one file, and it would leverage of any effort that users of the main 
UCD have made in parsing the XML format.


An XML format shold also be flexible in that you can add/remove not just 
characters, but properties as needed.


The worst thing do do, other than designing something from scratch, 
would be to replicate the UnicodeData.txt layout with its random, but 
fixed collection of properties and insanely many semi-colons. None of 
the existing UCD txt files carries all the needed data in a single file.


A./



Re: Code pages and Unicode

2011-08-23 Thread Asmus Freytag

On 8/23/2011 12:00 PM, Richard Wordingham wrote:

On Mon, 22 Aug 2011 16:18:56 -0700
Ken Whistlerk...@sybase.com  wrote:


How about Clause 12.5 of ISO/IEC 10646:

001B, 0025, 0040

You escape out of UTF-16 to ISO 2022, and then you can do whatever
the heck you want, including exchange and processing of complete
4-byte forms, with all the billions of characters folks seem to think
they need.
Of course you would have to convince implementers to honor the ISO
2022 escape sequence...

Which they only need to if the text is in an ISO 2022 or similar
context.  Your idea does suggest that a pattern of
highhighSOlow  would be reasonable.


I don't see where Ken's reply (as quoted) suggests anything like that.

What he wrote is that, formally, 10646 supports a mechanism to switch to 
ISO 2022.


Therefore, formally, there's an escape hatch built in.

If and when such should be needed, in a few hundred years, it'll be there.
Until then, I find further speculation rather pointless and would love 
if it moved off this list (until such time).


A./




Re: RTL PUA?

2011-08-22 Thread Asmus Freytag

On 8/21/2011 7:34 PM, Doug Ewell wrote:

So what you are asking about is a directional control character that would 
assign subsequent characters a BC of 'AL', right?

You don't want to call this a LANGUAGE MARK or anything else that implies language 
identification, because of the existence of real language identification 
mechanisms and the history of Unicode and language tagging.


An ARM (Arabic RTL Mark) would be a sensible addition to the standard. 
It would close a small gap in design that currently prevents a fully 
faithful plain text export of bidi text from rich text (higher level 
protocol) formats.


In a HLP you can assign any run to behave as if it was following a 
character with bidi property AL.


When you export this text as plain text, unless there is an actual AL 
character, you cannot get the same behavior (other than by the 
heavy-handed method of completely overriding the directionality, making 
your plain text less editable).


So, yes, there's a bit of a use case for such a mark.

(It's effect is limited to treatment of numeric expressions, so it's not 
an Arabic language mark, but one that triggers the same bidi context 
as the presence of an Arabic Script (AL) character.)


A./


--
Doug Ewell • d...@ewellic.org
Sent via BlackBerry by ATT

-Original Message-
From: Richard Wordinghamrichard.wording...@ntlworld.com
Sender: unicode-bou...@unicode.org
Date: Mon, 22 Aug 2011 03:19:39
To: Unicode Mailing Listunicode@unicode.org
Subject: Re: RTL PUA?

On Sun, 21 Aug 2011 23:55:46 +
Doug Ewelld...@ewellic.org  wrote:


What's a LANGUAGE MARK?

There are *three* strong directionalities - 'L' left-to-right, 'AL'
right-to-left as in Arabic, 'R' right-to-left (as in Hebrew, I
suspect).  'AL' and 'R' have different effects on certain characters
next to digits - it's the mind-numbing part of the BiDi algorithm.
With one a $ sign after a string of European (or is it Arabic?) digits
appears on the left and in the other it appears on the right.  I
can't remember whether 'higher-level protocols' have an effect on this
logic. LRM has a BC of L, RLM has a BC of R, but no invisible character
has a BC of AL. That's why I tentatively raised the notion of ARABIC
LANGUAGE MARK.  Incidentally, an RLO gives characters with a
temporary BC of R, not AL.

Richard.









Re: Implement BIDI algorithm by line

2011-08-22 Thread Asmus Freytag

Huh? What context is this in?

On 8/22/2011 11:18 AM, CE Whitehead wrote:

Hi.

I think many line breaks within paragraphs are soft line breaks but 
that embedding levels have to be taken into account when deciding the 
width of the glyphs; that's as near as I can tell.


Here is the description of the algorithm -- is this what you have read?
http://unicode.org/reports/tr9/
Some rules are in fact applied after the line wrapping (after the soft 
breaks) --
The following rules describe the logical process of finding the 
correct display order. As opposed to resolution phases, these rules 
act on a per-line basis/and are applied *after* any line wrapping is 
applied to the paragraph./

Logically there are the following steps:

  * The levels of the text are determined according to the previous rules.
  * The characters are shaped into glyphs according to their context
/(taking the embedding levels into account for mirroring)./
  * The accumulated widths of those glyphs /(in logical order)/ are
used to determine line breaks.
  * For each line, rules L1 http://unicode.org/reports/tr9/#L1–L4
http://unicode.org/reports/tr9/#L4 are used to reorder the
characters on that line.



(I'd have to reread the whole document on line breaking then on bidi 
to answer this truely; sorry; hope this helps anyway)

--C. E. Whitehead
cewcat...@hotmail.com




Re: RTL PUA?

2011-08-21 Thread Asmus Freytag

On 8/21/2011 3:31 PM, Richard Wordingham wrote:

On Sun, 21 Aug 2011 11:00:26 -0600
Doug Ewelld...@ewellic.org  wrote:


I think as soon as we start talking about this many scenarios, we are
no longer talking about what the *default* bidi class of the PUA (or
some part of it) should be.  Instead, we are talking about being able
to specify private customizations, so that one can have 'AL' runs and
'ON' runs and so forth.

I was exploring the consequences to see if there was a one size fits
all solution.  Someone (you?) suggested ON as a default, and I like
it.  I think it would also work fairly well for practical CJK
applications as well - the only problems are that LRM and RLM would
occasionally be needed, and the subtle differences between AL and R
would be lost.  I expect ARABIC LANGUAGE MARK would not go down well
- has it already been proposed and rejected?.


If your implementation supported the directional overrides, it would be 
possible to use these to lay out any RTL text in a portable manner. Just 
enclose any RTL run with RLO and PDF (pop directional formatting).


No impact on any existing implementation, no impact on the standard.

Those who produce rendering engines that do not support these overrides 
today could be leaned on to upgrade their implementations - that change 
would benefit users of non-PUA RTL languages as well (because sometimes, 
the bidi-algorithm can fail, such as for part numbers, and being able to 
use RLO is a simple way to stabilize such problematic text).


Treating PUA characters as ON is very problematic - their display would 
become context sensitive in unintended ways. No users of CJK characters 
would think of using LRM characters, but if text is inserted or viewed 
in RTL context, it could behave randomly.


In contrast, always supplying a RLO override for RTL text (containing 
PUA characters) would be a simple thing to remember and to get right.


A./




Re: RTL PUA?

2011-08-20 Thread Asmus Freytag

On 8/20/2011 6:44 PM, Doug Ewell wrote:

Would that really be a better default? I thought the main RTL needs for the PUA 
would be for unencoded scripts, not for even more Arabic letters. (How many 
more are there anyway?)

In any case, either 'R' or 'AL' as the Plane 16 default would be an improvement 
over having 'L' for the entire PUA.




The best default would be an explicit PU - undefined behavior in the 
absence of a private agreement.


However, it helps to remember why the PUAs exist to begin with. The 
demand came from East Asian character sets, which long had had such 
private use areas. In their case, the issue of properties did not 
seriously arise, because the vast bulk of private characters where 
ideographs.


I bet this remains true, and so the original motivation for the 
suggestion of L as the default would still apply - no matter how 
unsatisfactory this is from a formal point of view.


If maintaining the L default were to fail on the cliff of political 
correctness (or the fairness argument that has been made) the only 
proper solution is to use a value of unknown (i.e the hypothetical PU 
value) for all private use code points.


There are some properties where stability guarantees prevent adding a 
new value. In that case, the documentation should point out that the 
intended effect was to have a PU value, but for historical / stability 
reasons, the tables contain a different entry.


Suggesting a structure on the private use area, by suggesting 
different default properties, ipso facto makes the PUA less private. 
That should be a non-starter.


A./




Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Asmus Freytag

On 8/19/2011 2:35 PM, Jukka K. Korpela wrote:

20.8.2011 0:07, Doug Ewell wrote:


Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.


And now we think that a little over a million is enough for everyone, 
just as they thought in the late 1980s that 16 bits is enough for 
everyone.




The difference is that these early plans were based on rigorously *not* 
encoding certain characters, or using combining methodology or variation 
selection much more aggressively. That might have been more feasible, 
except for the needs of migrating software and having Unicode-based 
systems play nicely in a world where character sets had different ideas 
of what constitutes a character.


Allowing thousands of characters for compatibility reasons, more than 
ten thousand precomposed characters, and many types of other characters 
and symbols not originally on the radar still has not inflated the 
numbers all that much. The count stands at roughly double that original 
goal, after over twenty years of steady accumulation.


Was the original concept of being able to shoehorn the world into 
sixteen bit, overly aggressive? Probably, because the estimates had 
always been that there are about a quarter million written elements. 
If you took the current repertoire and used code-space saving techniques 
in hindsight, you might be able to create something that fits into 
16-bits. But it would end up using strings for many things that are now 
single characters.


But the numbers, so far, show that this original estimate of a quarter 
million, rough as it was, appears to be rather accurate. Over twenty 
years of encoding characters have not been enough to exceed that.


The million code points are therefore a much more comfortable limit 
and, from the beginning, assume a ceiling that has ample head-room (as 
opposed to the can we fit the world in this shoebox approach of 
earlier designs).


So, no, the two cases are not as comparable.

A./




Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread Asmus Freytag

On 8/19/2011 3:24 PM, Ken Whistler wrote:

On 8/19/2011 2:07 PM, Doug Ewell wrote:

Technically, I think 10646 was always limited to 32,768 planes so that
one could always address a code point with a 32-bit signed integer (a
nod to the Java fans).


Well, yes, but it didn't really have anything to do with Java. 
Remember that Java
wasn't released until 1995, but the 10646 architecture dates back to 
circa 1986.


Yep.

So more likely it was a nod to C implementations which would, it was 
supposed,
have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t, 
and which
would have wanted a signed 32 bit type to work. I suspect, by the way, 
that that
limitation was probably originally brought to WG2 by the U.S. national 
body,

as they would have been the ones most worried about the C implementations
of 10646 multi-octet forms.


No, it was the Japanese NB, as represented by the individual from Toppan 
Printing.


This limitation was insisted upon in 1991, after the accord on the 
merger between

Unicode and 10646, when 10646 was changed to use a flat codespace, not the
ISO 2022-like scheme.



And the original architecture was also not really a full 32K planes in 
the sense

that we now understand planes for Unicode and 10646. The original design
for 10646 was for a 1- to 4-octet encoding, with all octets conforming 
to the

ISO 2022 specification. It used the option that the working sets for the
encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and
for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were
not used except for the single-octet form, as in 2022-conformant schemes
still used today for some East Asian character encodings.

And the octets were then designated G (group) P (plane) R (row) and C.

The 1-octet form thus allowed 95 + 96 = 191 code positions.

The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions

The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions

The Group octet was constrained to the low set of 94. (This is the origin
of the constraint to half the planes, which would keep wchar_t 
implementations

out of negative signed range.)

The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions

The grand total for all possible forms was the sum of those values or:

*631,279,375* code positions

(before various *other* set-asides for plane swapping and private
use start getting taken into account)


This was so mind-bogglingly complicated that it was a deal breaker for 
many companies. Unicode's more restrictive concept of a character or its 
combining technology or many other innovations weren't initially seen as 
its primary benefits by people being faced with evaluating the 
differences between the formal ISO-backed project and the de-facto 
industry collaboration forming around Apple and Xerox. But the flat code 
space, now you were talking.



Of course, 2.1 billion characters is also overkill, but the advent of
UTF-16 was how we ended up with 17 planes.


So a lot less than 2.1 billion characters. But I think Doug's point is 
still valid:

631 million plus code points was still overkill for the problem to
be addressed.

And I think that we can thank our lucky stars that it isn't *that* 
architecture for
a universal character encoding that we would now be implementing and 
debating on

the alternative universe version of this email list. ;-)


Even remembering it makes my head hurt.

A./



Re: What are the present criteria...

2011-08-18 Thread Asmus Freytag

On 8/18/2011 7:29 AM, Doug Ewell wrote:

Karl Pentzlinkarl dash pentzlin at acssoft dot de  wrote:


The quoted indicators for benefit were part of a concern of the German
NB regarding the Wingding/Webding proposals. The concern expressed in
WG2 N4085 is that some characters proposed there conform neither to
the policy statements by UTC or WG2, nor to the indicators of benefit
which the German NB would accept as an additional reason to encode
Wingding/Webding characters beyond the formal policies of UTC and WG2.

Nevertheless, N4085 is a German NB document, the criteria in question
are those suggested by the German NB and not WG2 (and the document makes
note of this distinction), and it is an error to portray this passage as
representing either a change or a lack of clarity in UTC or WG2 policy.


Karl makes no such claim. The document states that 2093-2096 appear to 
be in violation of the character glyph model. I believe that's the 
section (or one of the sections) in the document that Karl summarizes 
here as policy statements by UTC or WG2 - at least it would fit.


Anyway, it's more useful to focus on the actual concerns, not about 
whether Karl summarized them correctly in his email.


The German NB introduces the concept of indicator of benefit [to] the 
user, and then defines that as:

- evidence of actual use
- evidence that it's likely a wrong character might be used for lack of 
an encoded character

- conformance to other standards
(I've slightly rephrased for clarity).

I have several problems with this approach.

First, these indicators are rather haphazardly compiled. Overwhelming 
evidence of plain text use, and conformance requirements are already 
recognized as valid reasons to encode characters (not just symbols). 
They do not, however, help in evaluating those proposals where more 
nuanced judgement is required. The third element, that the wrong 
character might be mistakenly used, is of overriding concern only in 
particular cases where questions of unification or disambiguation need 
to be decided.


Second, it's really unsatisfactory if each NB has their own criteria for 
when to add characters to the standard, and it's especially unsettling 
when such criteria seem to be ad-hoc applied to a given repertoire. 
WG2 and Unicode have had lengthy discussions and broad consensus about 
the kinds of criteria to take into account when encoding characters in 
general or symbols in particular.


The result has been captured in a number of documents, for example, 
here's the original one from the UTC: 
http://unicode.org/pending/symbol-guidelines.html. (with links to more 
recent versions).


Unlike the list in N4085, the criteria adopted by UTC and WG2 are not 
formulated as PASS / FAIL. Instead, they were carefully designed to be 
used in assigning weight in favor or in disfavor of encoding a 
particular symbol as a character. This recognizes an important 
principle, which has been notably absent in much recent discussion: it 
is generally not possible to create any set of criteria that can be 
applied mechanistically (or algorithmically). The decision to encode a 
character is and remains a judgement call. Some calls are easy, because 
the evidence is overwhelming and direct, some calls are more difficult, 
because the evidence may be uncertain or indirect, or the nature of the 
proposed character may not be as well understood as one would ideally 
prefer.


Recognizing these inherent difficulties in the encoding work and the 
need for a set of weighing factors instead of simplistic PASS / FAIL 
criteria was one the early break-throughs in the work of WG2 and UTC. 
Accordingly the documents speak not of criteria whether to encode 
characters, but criteria that strengthen (resp. weaken) the case for 
encoding. That's a crucial difference.


While the details of these criteria (or factors) can and should be 
evaluated from time to time for continued appropriateness, the soundness 
of the general methodology is not in question, and UTC and WG2 should 
resist any attempts (directly or indirectly) to abandon them in favor of 
an unworkable, simplistic, and ad-hoc PASS / FAIL approach.


What are relevant criteria?

The document I cited lists the original set of criteria as follows


 What criteria strengthen the case for encoding?

   The symbol:

 * is typically used as part of computer applications (e.g. CAD
   symbols)
 * has well defined user community / usage
 * always occurs together with text or numbers (unit, currency,
   estimated)
 * must be searchable or indexable
 * is customarily used in tabular lists as shorthand for
   characteristics (e.g. check mark, maru etc.)
 * is part of a notational system
 * has well-defined semantics
 * has semantics that lend themselves to computer processing
 * completes a class of symbols already in the standard
 * is letterlike (i.e. should vary with the surrounding font style)


 

Re: Sanskrit nasalized L

2011-08-16 Thread Asmus Freytag

On 8/16/2011 1:57 AM, Andrew West wrote:

On 16 August 2011 02:59, Richard Wordingham
richard.wording...@ntlworld.com  wrote:

All I've got to go on is the penultimate sentence in TUS 6.0 Section
10.2 - 'Rarely, stacks are seen that contain more than one such
consonant-vowel combination in a vertical arrangement'.

http://www.unicode.org/versions/Unicode6.0.0/ch10.pdf#G30110

Which is followed immediately by the caveat:

These stacks are highly unusual and are considered beyond the scope
of plain text rendering. They may be handled by higher-level
mechanisms.


That's all well and good.


The question is: have any such mechanisms been defined and deployed by 
anyone?


A./


The Tibetan script doesn't have a combining virama.  I would expect the
natural coding to be something like letter-vowel-subjoined
letter-vowel, e.g.U+0F40 TIBETAN LETTER KA, U+0F74 TIBETAN VOWEL SIGN
U, U+0FB2 TIBETAN SUBJOINED LETTER RA, U+0F74 TIBETAN VOWEL SIGN U.

As the Unicode Standard explicitly states, non-standard stacks such as
this (which really are highly unusual, and only occur in a few
specific contexts) are outside the scope of plain text rendering, and
are not defined by the standard.  It therefore makes no sense for you
to try to specify character sequences for such non-standard stacks.

Andrew







Re: Non-standard Tibetan stacks (was Re: Sanskrit nasalized L)

2011-08-16 Thread Asmus Freytag

On 8/16/2011 3:32 PM, Andrew West wrote:

On 16 August 2011 18:19, Asmus Freytagasm...@ix.netcom.com  wrote:

These stacks are highly unusual and are considered beyond the scope
of plain text rendering. They may be handled by higher-level
mechanisms.

The question is: have any such mechanisms been defined and deployed by
anyone?

In my opinion, until someone produces a scan of a Tibetan text with
multiple consonant-vowel sequences, and asks how they can represent it
in plain Unicode text there is no question to be answered.


Thank you Andrew - that clarifies the issue for the non-specialist.

A./



Chris Fynn asked about certain non-standard stacks he was trying to
implement in the Tibetan Machine Uni font in an email to the Tibex
list on 2006-12-09, but these didn't involve multiple consonant-vowel
sequences (one stack sequence was0F43 0FB1 0FB1 0FB2 0FB2 0F74 0F74
0F71  which would be reordered to0F42 0FB7 0FB1 0FB1 0FB2 0FB2 0F71
0F74 0F74  by normalization which would display differently).

Other non-standard stacks that I have seen involve horizontal
progression within the vertical stack (e.g. yang written horizontally
in a vertical stack).

More recently, the user community needed help digitizing Tibetan texts
that used the superfixed letters U+0F88 and U+0F89 within non-standard
stacks, resulting in a proposal to encode additional letters
(http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3568.pdf).

None of these non-standard stack use cases involved multiple
consonant-vowel sequences, and I'm not sure whether I have ever seen
an example of such a sequence.  I have learnt that there is little
point discussing a solution for a hypothetical problem, because when
the real problems arise they likely to be something different.

Andrew






Re: Greek Characters Duplicated as Latin

2011-08-14 Thread Asmus Freytag

On 8/14/2011 1:39 PM, Richard Wordingham wrote:


U+00B5 MICRO SIGN is an ISO-8859-1 character, and was therefore
included as U+00B5.  It normally precedes a Latin-script letter, and
therefore it actually makes sense to treat it as a Latin-script
character, and possibly give it a different shape in these contexts to
the shape of the Greek letter in Greek text.


I don't think that there's a strong and overriding reason to give this 
character a separate shape.


As you note, the true reason that this character was encoded separately 
has to do with the requirement that the first 256 code points of Unicode 
should match 8859-1, so that simply widening a byte to 16 or 32 bits 
would transform 8859-1 data to UTF-16 or UTF-32.  With the predominance 
of UTF-8 as format for interchanging Unicode, something that wasn't 
foreseen from the beginning, this design criteria has lost slightly in 
importance. However, it helped the migration to Unicode, by making 
conversion of the vast majority of data (at the time ASCII and 8859-1 
accounted for the bulk of existing data on the net) dead simple.


With anything as radically different from its predecessors as Unicode, 
keeping as much familiarity as possible was a major concern.


Now, once you list the small mu among the first 256 characters, you then 
have to ask the question what to do with the Greek alphabet. The basic 
alphabets are used in so many ways in software (for automatic numbering 
of headings, etc.) that disrupting this sequence (and leaving out the mu 
from the Greek alphabet) wasn't a realistic choice.


Hence, the duplication.

It does not alter the fact, that the micro sign really is just a usage 
of the Greek small mu, and not actually a new entity.


Because the micro sign was widely implemented in systems and fonts that 
do not support the full set of Greek characters, I wouldn't be surprised 
to find that there are instances where the design was adjusted to make 
it fit better in a Latin environment. If so, these developments likely 
predate Unicode substantially, because this use of mu was supported in 
older technology as well. I recall seeing it on typewriter keyboard 
(mechanical).


I'm not sure I agree with the need to have a Latinized mu, but it 
exists and there you have it. Having two separate code points will allow 
these characters to have a separate development in the future.




U+0216 OHM SIGN is similar to U+00B5 MICRO SIGN, except that it is used
on its own.  Whether it should be merged with U+03A9 GREEK CAPITAL
LETTER OMEGA is debatable, but that is what has been done.


The Ohm sign should have been encoded as another example of squared 
letters and abbreviations. It comes from Asian character sets, where, 
inexplicably, it exists separately from and alongside to the capital 
Greek Omega - which they also encode.


In order to allow loss-less conversion to/from these sets, there was a 
need to have a code point for the Ohm.


The Omega for Ohm was never as widely used as the mu, and it's 
questionable whether there really was much of a development of a 
different form for it. The Asian fonts that I knew in the 80's did not 
have different forms.


In modern usage, for new documents, this character should not be used.

A./



Re: Anything from the Symbol font to add along with W*dings?

2011-08-14 Thread Asmus Freytag

On 8/14/2011 12:51 PM, Jukka K. Korpela wrote:

14.8.2011 17:51, Doug Ewell wrote:


This sounds like Jukka expects browsers to analyze the glyph assigned in
the font to the code position for 'a' and decline to display it if it
doesn't look enough like an 'a' (rejecting, for example, Greek 'α'). I'm
not sure that is a reasonable expectation.


That wouldn’t be reasonable, but what I expect is that fonts have 
information about the characters that the glyphs are for and browsers 
use that information. Something like that is required for implementing 
the CSS font matching algorithm:

http://www.w3.org/TR/CSS2/fonts.html#algorithm



Not all documents are HTML or CSS.

Font overloading of this kind is common in many rich text documents 
and not limited to the Symbol font. Yes, it makes text non-portable in 
certain ways. Private use characters would have been a cleaner way to 
achieve the same non-portability. Windows will let you use private use 
characters to access symbol fonts (not just the symbol font), but this 
feature is not widely used (despite the fact that it dates back to the 
earliest days of Unicode support on that platform).


Why users voted with their feet (or keystrokes) is not a useful topic of 
speculation. The fact is, they did.


The question here is whether it's useful to add code additional points 
to allow plain-text coverage of certain widely spread fonts (of which 
the symbol font is one) so that it's possible to use, for example, 
automated processes to re-encode font runs in older documents to make 
them more fully portable.


If there are indeed some characters missing to complete that goal, the 
numbers are small and similar fragments of mathematical symbols have 
been encoded before. I would see not principled objection - only the 
question whether these are truly still unmapped (however, I haven't 
researched these particular characters, so I'm not giving any comments 
related to them in particular).


A./


Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)

2011-08-05 Thread Asmus Freytag (w)
The ambiguity of an initial FEFF was not desirable, but this discussion shows 
that certain things can't be so easily fixed by adding characters at a later 
stage.

The more time elapsed between encoding of the ambiguous character and the later 
fix the more software, the more data, and the more protocols exist that 
support the original character, creating backwards compatibility issues.

Incidentally, this is totally what I expected when the WJ was proposed, but 
sentiment in favor of its addition ran high at the time...

The ZWNBSP was present in Unicode 1.0 (1991) while the WJ was added in 3.2 
(2002), that is about 10 years later. We are now an additional 10 years down 
the road, and instead of clarifying the issue, the interim result is that WJ 
has muddied the waters instead.

Somewhere here are lessons to be learned.

A./


-Original Message-
From: Doug Ewell d...@ewellic.org
Sent: Aug 5, 2011 8:49 AM
To: unicode@unicode.org
Subject: Re: ZWNBSP vs. WJ (was: How is NBH (U0083) Implemented?)

Jukka K. Korpela jkorpela at cs dot tut dot fi wrote:

 So? It was, and it still often is, better to use ISO 8859-1 rather
 than Unicode, in situations where there no tangible benefit, or just a
 smal l benefit, from using Unicode. For example, many people are still
 conservative about encodings in e-mail, for good reasons, so they use
 ISO 8859-1 or, as you did in your message, windows-1252.

A word about my encoding choices.  My first message on Thursday was
sent from my home PC, using Windows Live Mail, and it used UTF-8 because
I configured Windows Live Mail to do so.  My second message was sent
from my mobile device, and used Windows-1252.  I don't know if there is
a way to tell the device to use UTF-8 for outgoing messages, but I can
say it was not my conscious intent to prefer Windows-1252 over Unicode.

This message is being sent via a Web interface; I guess we'll find out
what encoding it chooses for me.

 On the other hand, this isn’t comparable to ZWNBSP vs. WJ. These
 control characters do the same job in text, as per the standard, so
 the practical question is simply which one is better supported.

ZWNBSP, like WJ, is intended to inhibit breaking between words.  Despite
the other (and original) intended use of U+FEFF at the start of a text
as a byte-order mark, there is a pervasive belief that an initial U+FEFF
means the text should be treated as beginning with some kind of space
character.  This is silly, since there is no concept of between words
at the start of a text, but it is nevertheless the way people perceive
things.

WJ was introduced to encourage users to separate these two functions. 
If users don't adopt it, the problem will never be solved.  There are
enough issues in Unicode that cannot be fixed due to stability concerns;
it would be nice to be able to fix this one at least.

I still question how many real-world texts use either U+FEFF or U+2060
to achieve this non-breaking behavior.

 ISO 8859-1 and Unicode perform very different jobs, so that using ISO
 8859-1, you limit your character repertoire (at least as regards to
 directly representable characters, as opposite to various “escape
 notations”). If you don’t need anything outside the ISO 8859-1, the
 choice used to be very simple, though nowadays it has become a little
 more complicated (as e.g. Google Groups seems to munge ISO 8859-1 data
 in quotations but processes UTF-8 properly)

UTF-8 has the property of being easily detected and verified as such,
which solves part of the Google Groups problem (inability to detect
which SBCS is being used).  The other part of the problem is the
practice of using heuristics to override an explicit charset
declaration, but that is a topic for another day.

 I won’t make any statements about full compliance, but in Microsoft
 Office Word 2007, U+FEFF alias ZWNBSP does its basic job (inside text)
 in most situations whereas U+2060 alias WJ seems to be not recognized
 at all and appears as some sort of a visible box. So to have a job
 jone, there is not much of a choice. (Word 2007 fails to honor ZWNBSP
 semantics after EN DASH, which is bad, but it does not make it useless
 in other situations.)

It does always come down to a complaint against Microsoft, doesn't it? 
Unfortunately, Yucca is right here: opening Word 2007 and pasting a
snippet of text with embedded ZWNBSP does display correctly, while the
same experiment with embedded WJ shows a .notdef box.  This seems to be
a font-coverage problem, amplified by Word's silent overriding of user
font choices—changing the font from the default Calibri to DejaVu Sans
(and optionally back to Calibri) makes the display problem go away, but
of course no user could reasonably be expected to go through that.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­









Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-17 Thread Asmus Freytag

On 7/17/2011 2:47 AM, Petr Tomasek wrote:

On Sun, Jul 17, 2011 at 10:14:55AM +0100, Julian Bradfield wrote:


Wouldn't it be more economical to encode a single UNICODE ESCAPE
CHARACTER which forces the following character to be interpreted as a
printable glyph rather than any control function?

I already thought about this but this would probably mean that
algorithms (like the Unicode BiDi Algorithm) would have to be changed.



Change that to: it would mean that ALL algorithms that interpret any of 
the invisible characters would have to change.


The reason is, of course, because these codes would *reinterpret* 
existing characters. You could argue that Variation Selectors do the 
same, but they are carefully constructed so that they can be safely 
ignored. These suggested character couldn't be safely ignored, because 
doing so would have control/formatting codes in the middle of text where 
none were intended.


Michael has it right:

On 7/17/2011 2:35 AM, Michael Everson wrote:


... invisible and stateful control characters are more expensive than ordinary 
graphic symbols.


In this case, the expense is so much higher as to rule out such an idea 
from the start.


A./

PS: this doesn't mean that adding graphic symbols is the foregone thing 
to do, only that, if evidence points to the need to address this issue 
in character encoding, then, using graphic symbols is the better way to 
go about it.




Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webdingproposal)

2011-07-17 Thread Asmus Freytag

On 7/17/2011 12:19 PM, Doug Ewell wrote:

Asmus wrote:


The reason is, of course, because these codes would *reinterpret* existing 
characters. You could argue that Variation Selectors do the same, but they are 
carefully constructed so that they can be safely ignored.



Variation selectors don't change the interpretation of characters, only their 
visual appearance.




The process of display is part of the more general concept of 
interpretation as this term is used in the Unicode Standard.


A./

PS: and variation selectors don't necessarily even change the visual 
appearance of a character. If the glyph shape for the given character in 
the selected font already matches or falls into the glyphic subspace 
indicated by the variation sequence, then you would not observe any 
change. (Ditto for display processes that don't support variation 
selectors, but that's a whole different kettle of fish).




Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-17 Thread Asmus Freytag

On 7/17/2011 12:19 PM, Philippe Verdy wrote:

2011/7/17 Asmus Freytagasm...@ix.netcom.com:

On 7/17/2011 2:35 AM, Michael Everson wrote:

... invisible and stateful control characters are more expensive than
ordinary graphic symbols.

In this case, the expense is so much higher as to rule out such an idea from
the start.

A./

PS: this doesn't mean that adding graphic symbols is the foregone thing to
do, only that, if evidence points to the need to address this issue in
character encoding, then, using graphic symbols is the better way to go
about it.

Another alternative: instead of encoding separate symbols for each
control, we could as well encode symbols for each character visible in
those symbols.

E.g. ro represent the glyph for the RLO control, we could encode three
characters, one for each of R, L, and O, as DOTTED SYMBOL FOR LATIN
CAPITAL LETTTER R, DOTTED SYMBOL FOR LATIN CAPITAL LETTER L, DOTTED
SYMBOL FOR LATIN CAPITAL LETTER O. These three symbols would have a
representative glyph as the base letter from which they are derived,
within a dotted rectangle.

Then each of them would contextually adopt one of four glyph forms :
the full rectangle, or the rectangle with the left or right side
removed, or both sides removed. The selection would be performed
selectively.


I'm baffled: what problem is this elaborate scheme trying to solve?

The problem was never in *how* to encode such symbols, but in *whether* 
they should be considered *characters* (and therefore need to be 
supported on the character level of the architecture). That point, 
whether there's a reasonable use case for them as characters, has not 
been settled, so the case for thinking about encoding solutions has not 
been established.


When people write about a line feed character, they use LF or 
linefeed or 000A (or U+000A or 0x0A etc.). They commonly don't use the 
LF symbol character, nor any other unencoded symbol.


I claim, the same is true for ZWJ, RLO, PDF and all the other good 
characters.


Just because Unicode uses dashed box placeholders in the code charts 
hasn't made them the generally accepted, universally understood 
*symbols* for these characters.


This is different from the pictures for control codes because at the 
time, these were widely supported in devices, and users of these devices 
(terminals) were familiar with the convention (staggered small letters) 
and many would recognize common control characters.


So, let's keep a lid on devising ever more arcane and fragile encoding 
and pseudo-encoding options until there's consensus that this issue must 
be addressed on the character level.


A./



Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webdingproposal)

2011-07-16 Thread Asmus Freytag

On 7/15/2011 10:48 PM, Doug Ewell wrote:

I apologize for the unintended content-free post. It's my phone's fault.

--



My dog ate the homework - 2011?

:)

A./



Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-16 Thread Asmus Freytag

On 7/16/2011 1:53 AM, Michael Everson wrote:

On 16 Jul 2011, at 04:37, Asmus Freytag wrote:


It's not a matter of competing views. There's a well-defined process for 
adding characters to the standard. It starts by documenting usage.

Yes, Asmus, and when one wants to do that, one writes a proposal. We aren't 
writing a proposal here. We're *talking* about things.


I fully understand the difference between making a formal proposal (that 
can be acted upon) and informally chatting about the possible needs for 
some characters - and the chances that a successful proposal might be 
written.


However, if the only hard information are assertions of personal 
preference such as Sometimes I might want to show a dotted box for NBSP 
and sometimes a real NBSP, it is a bit much to then conclude What I 
see is a certain unreasonability reflecting a certain conservatism 
because there isn't an immediate, public enthusiasm for the idea.


A./

PS: My counter-assertion, that much of the technical literature uses the 
abbreviations in preference to dashed boxes, has been pointedly ignored 
by you. UAX#9, bidi, and UAX#14, linebreak, extensively discuss 
invisible characters - neither of these documents needs symbol 
characters, in fact, they would probably reduce clarity. This practice 
goes back over 15 years, so it can be seen as settled. (I further 
assert that I expect examples could be found outside the standard as well).


PPS: If anybody provides evidence (suitably documented for the level 
of discussion) of widespread use of symbolic depictions for certain 
invisible characters, I'd be quite open to review it and to base my 
future position on this new basis.





Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-16 Thread Asmus Freytag

Karl,

I've published similar surveys in the past, where the object was to 
get feedback on the desirability of further action. I stick by my 
recommendation in favor of keeping raw data out of the document 
registry and of doing the committee a favor by adding value in form of 
a sifting or analysis of such data.


Previewing the data is not the same as making a character encoding 
proposal, and there aren't any procedural rules for non-proposals, so 
there's nothing that prevents doing that. I have always provided some 
level of analysis, and I have not always chosen to register all such 
documents - for the reasons I gave you earlier.


The original rationale for encoding certain symbols had been their 
widespread use. The word widespread is key here. At the time that 
Unicode was first created, symbol sets associated with printers defined 
widespread use. After these sets were backed into the 2600 and 2700 
blocks, the phenomenal rise of Windows made the W/W-Dings sets even more 
widespread.


As you and WG2 evaluate additional such widely disseminated fonts, you 
will need to come up with your own criteria of what constitutes 
widespread. Those criteria should be applied both to the fonts 
considered as potential source of symbols, as well as to each category 
of symbols within these fonts.


I'll be interested in looking at a list of Apple symbols, once it's 
categorized a bit better by symbol function and / or gives a better idea 
of which (and how many) symbols extend existing sets (e.g. by adding 
directional variants) and which (and how many) might possibly be only 
variants of existing symbols - and similar information like that. 
(Unlike a full character encoding proposal I would not expect definite 
answers to these, but some tentative / approximate information would be 
nice).


A./



Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Asmus Freytag

On 7/15/2011 1:08 AM, Karl Pentzlin wrote:

In WG2 N4085 Further proposed additions to ISO/IEC 10646 and comments to other 
proposals (2011‐
05‐25), the German NB had requested re WG2 N4022 Proposal to add Wingdings and 
Webdings
Symbols besides other points:
   Also, in doing this work, other fonts widespread on the computers of 
leading manufacturers (e.g.
   Apple) shall be included, thus avoiding the impression that Unicode or 
SC2/WG2 favor a single
   manufacturer.
In supporting this, there is now a quick survey of symbol fonts regularly 
delivered with computers
manufactured by Apple:
   http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4127.pdf

- Karl




 Karl,

I believe that publishing this document in its current form is a more of 
a disservice than a service to the committees or the larger community (a 
few individuals excepted).


There appear to be a large number of symbols for which a Unicode 
equivalent can be identified with great certainty - and beyond that 
there seem to be characters for which such an assignment is perhaps more 
tentative, because of minor glyph differences, but still plausible.


I believe that only when these two passes have been carried out, will 
the document be of any reasonable use to wider audiences - as it is, 
everybody has to sift through all the characters, even the ones that are 
uninteresting (because their mappings are not in question, despite lack 
of glyph names).


Using Unibook, you can use the syntactic conventions of  canonical and 
compatibility decomposition listings to show mappings of which you are 
certain or which look OK, but need verification. Entirely questionable 
mappings could use the comment convention.


In the input file used by Unibook, a TAB=SPACE at the start of a 
line, followed by a code point can be used to show an identically 
equal sign with the mapping in the output. A TAB%SPACE would show 
the approximately equal sign, and a TAB*SPACE would yield a bullet 
(as for a comment).


Finally, you could use yellow (and/or blue) highlighting (or both) to 
highlight characters needing particular levels of review.


Once you have carried the analysis to that stage, the document would 
indeed be of interest for wider reviewers. It would still not be a 
proposal, but you would have done the necessary legwork in *analyzing* 
(or tentatively analyzing) the repertoire.


A./


Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Asmus Freytag

On 7/15/2011 9:03 AM, Doug Ewell wrote:

Andrew Westandrewcwest at gmail dot com  replied to Michael Everson:


I think that having encoded symbols for control characters (which we
already have for some of them) is no bad thing, and the argument
about too many characters is not compelling, as there are only some
dozens of these characters encoded, not thousands and thousands or
anything.

I oppose encoding graphic clones of non-graphic characters on
principle, not because of how many there are.

I agree with Michael about a lot of things, and this isn't going to be
one of them.  The main arguments I am seeing in favor of encoding are:

1. Graphic symbols for control characters are needed so writers can
write about the control characters themselves using plain text.


When users outside the character encoding community start reporting such 
a need in great numbers, it would indicate that there might (might!) be 
a real requirement. The character coding community has had decades to 
figure out ways to manage without this - and the current occasion 
(review of Apple's symbol fonts) is not a suitable context to suddenly 
drag in something that could have been addressed anytime for the last 20 
years, if it had been really urgent.


I don't think there's any end to where this can go.  As Martin said,
eventually you'd need a meta-meta-character to talk about the
meta-character, and then it's not just a size problem, but an
infinite-looping problem.


What real users need is to show hidden characters. That need can be 
served with different mechanisms. There seems to not be a consensus 
though, on what the preferred approach should be and implementations 
disagree. That kind of issue needs to be addressed differently, 
involving the cooperation of major implementers.




2. The precedent was established by the U+2400 block.

I thought those were compatibility characters, in the original sense:
encoded because they were part of some pre-existing standard.  That's
not necessarily a precedent in itself to encode more characters that are
similar in nature.


Doug is entirely correct. These are a precedent only if an extended set 
of other such symbols was found in use in some de-facto character set. 
In that special case, an argument for compatibility with *that* 
character set could be made. And for that to be successful, it would 
have to be shown that the character set is widely used and compatibility 
to it is of critical importance.


In addition, I claim, experience has shown that the the control code 
image characters are not widely used. That means, any hope that the 
early encoders (and these go back to 1.0) may have had that those 
symbols are useful characters in their own right, simply have not been 
borne out.




3. There aren't that many of them.

We regularly dismiss arguments of the form But there's lots of room for
these in Unicode when someone proposes to encode something that
shouldn't be there.  I don't see this as any different.


Correct.

The only time this argument is useful is in deciding between encoding 
the same character directly or as character sequence. Using character 
sequences solely because of encoding space reasons, as opposed to the 
reason that the elements are characters in their own right, has become 
irrelevant due to the introduction of 16 more planes.


The same is true for excessive unification of certain symbols or 
punctuation characters: saving code space is not a valid argument here - 
so any decision needs to be based on other facts.


Michael is responsible for adding many thousands of characters to
Unicode, so it's awkward for me to be debating character-encoding
principles with him, but there we are.





Well, in this business, no-one's infallible.

A./



Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Asmus Freytag

On 7/15/2011 2:23 AM, Karl Pentzlin wrote:

Am Freitag, 15. Juli 2011 um 10:58 schrieb Asmus Freytag:

AF  ... There appear to be a large number of symbols for which a
AF  Unicode equivalent can be identified with great certainty -
AF  and beyond that there seem to be characters for which such
AF  an assignment is perhaps more tentative, because of minor
AF  glyph differences, but still plausible. ...
AF  ... Once you have carried the analysis to that stage ...

My intent was to present the data to people who want to continue the
work in this way, and to encourage the discussion of the Apple symbols
within the Wingding/Webding discussion in line with the German NB request
cited in my original mail.


You would serve this goal much better if, instead of rushing to simply 
add raw data to the document pile, you had narrowed the issue down by 
limiting this further to characters that need real scrutiny.



Such analysis as Asmus requested, done with the appropriate scrutiny
and thus requiring a considerable amount of time, in fact is the next
logical step on this work. This, however, has not necessarily to be done
by myself.


So, essentially you are dumping it on everyone.

At this early stage (raw list) a better approach would have been to look 
for collaborators first and then collectively publish a document that 
provides useful analysis.


The document registry should be limited to documents that can and should 
be reviewed in committee. Raw data collection without or with limited 
value added do not belong, in my view.


A./

PS: I feel strongly enough about this that I will not review the 
document in its current stage.




Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Asmus Freytag

On 7/15/2011 10:26 AM, Michael Everson wrote:

What I see is a certain unreasonability reflecting a certain conservatism. Text 
about the Standard is important, and should be representable in an 
interchangeable way. Here { } is a Right to left override character. character. 
I want to talk about it in a way that is visible. Oops. I can't do it 
interchangeably.



Michael,

let me give you an example:

The Unicode Bidi Algorithm has extensive need to discuss this character, 
because it provides specification for its use and support by 
implementations. If you look at that document (UAX#9), you find this 
character discussed widely (and you can save that document to plain text 
without losing the sense of that discussion).


This example illustrates that we need to distinguish between the 
requirement to *discuss *characters and their use, and the perceived 
need to use *symbolic images* (glyphs) to do so. As the example of UAX#9 
shows, one does not follow from the other.


If there had been a universal requirement to use glyphs for this 
purpose, this requirement would have surfaced and could have been 
addressed anytime during the last 20 years. Another indication that this 
is not a universal requirement can be deduced from the fact that these 
glyphs do not show up in more font collections.


Several  symbols for space or blank were added however, because 
widespread use in documentation was attested. The same avenue should in 
principle be open for other such symbols (and here I disagree with 
Andrew and Martin): If widespread use of glyphic symbols (as opposed to 
abbreviations and names) can be documented for some characters, then 
those characters, and those characters only should have whatever symbol 
is used to represent them, added to the standard. Also, like the example 
for SPACE, if there are different symbols, any of them that is 
widespread should be added - to unify symbols of different design based 
on the underlying concept that they represent would constitute improper 
unification, in my view.


So, there, I'm not at all unreasonable - I just reasonably ask that the 
normal procedures for adding characters are to be followed.


In this particular case, the Apple glyphs include glyphs for format 
characters that Unicode considers deprecated. Providing characters to 
encode glyphs for them would just be a waste. Further, while the glyphs 
shown match those from the Unicode code charts, they are not necessarily 
the shapes that are displayed when systems want to show these invisible 
characters - so users and documentation writers may need an entirely 
different set of glyphs. Finally, other vendors seem to not have 
endorsed these glyphs by including them in their font collections - much 
unlike the emoji, where multiple vendors had a large overlap of symbols, 
and with large overlap in glyphic representation as well.


Therefore, I strongly urge the committees to separate out these meta 
characters from the ongoing *symbol collection* review.
They can be taken up based on evidence of actual use (and showing the 
actual glyphs in such use) at a later occasion.


A./


Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Asmus Freytag

On 7/15/2011 11:05 AM, Doug Ewell wrote:

What I see is a certain unreasonability reflecting a certain conservatism. Text 
about the Standard is important, and should be representable in an 
interchangeable way. Here { } is a Right to left override character. character. 
I want to talk about it in a way that is visible. Oops. I can't do it 
interchangeably.

[RTL] or {RTL} or Right-to-Left Override or U+202E might all
work.



The conventional abbreviations are:

RLO (Right-to-left override)
RLE (Right-to-left embedding)
RLM (Right-to-left mark)


--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­










Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread Asmus Freytag

On 7/15/2011 11:36 AM, Michael Everson wrote:

However, I agree with Asmus that in the context of the Wingdings-type symbols 
these characters should not be considered. They should be considered as a whole 
on their own.


Thank you Michael.

To reiterate and restate (so it can be read out of context):

   If widespread use of particular glyphic symbols for certain
   invisible characters (as opposed to abbreviations and names) can be
   documented, then those symbols, and those symbols only should are
   eligible to be added to the standard. As in the example for SPACE,
   if there are different such symbols denoting the same invisible
   character, any of them that is widespread could be added. Care
   should be taken not to unify symbols of different design merely
   based on the fact that they represent the same invisible character.


I simply ask that when and if these symbol characters are considered, 
the normal procedures for adding characters are to be followed. This 
includes adducing evidence of their use in documentation (other than the 
Unicode Standard itself) and similar publications. In particular, such 
documentation would need to be brought for each individual character 
(except perhaps for paired characters) as it is quite likely that some 
invisible characters not documented extensively  (for example the 
deprecated ones).


Finally, it would be valuable if research into the use of such glyphic 
symbols was thorough enough to encompass a more or less complete range 
of glyphs used for each invisible character, not simply the Unicode 
chart glyph.


A./



Definition of character

2011-07-12 Thread Asmus Freytag

Jukka,

reminding everyone of the definition of technical term as opposed to a 
word in everyday language isn't helping address the underlying issue. 
Everyone is familiar with this distinction.


You note that there's a bit of a truism that underlies the definition of 
character and character encoding, but I would claim this is not limited 
to Unicode, and has nothing to do with promoting that standard. The 
truism goes like this: A character is what character encodings encode.


As such, character also becomes the smallest unit on which algorithms 
for processing textual data operate.


Historically, character encodings have also encoded, on otherwise equal 
footing, units that are intended for device control. Over time, some of 
the device control characters have been redefined as indicators of 
logical division of text. (TAB and LF are the most prominent examples of 
this evolution).


These historical developments have left us with this and other examples 
of deep ambiguities in the definition of the members of those sets we 
call character encodings. These ambiguities are reflected in the 
technical (as opposed to everyday) usage of the term character. I 
fully agree with Ken that you can't fix this situation be definitional 
fiat.


Let's look at the putative benefit of a better definition. I think such 
a benefit has implicitly been claimed to exist, but I would ask for a 
demonstration in this case.


One possible benefit of a solid definition of the members of a set is in 
helping decide which additional entities should be made members of the 
set. Can there be a definition of character that provides a solid 
guidepost for evaluating future proposed character additions to the 
standard?


Over twenty years of work on the Unicode Standard (and decades of work 
on earlier standards) have clearly demonstrated that it is impossible to 
devise an algorithm for deciding the question of what candidates are 
worthy for being encoded in Unicode (or any other character encoding).


The problem goes back to the incredible diversity of writing systems and 
notations and their use. It is further complicated by the fact that 
breaking down a writing system into elements (identifying the 
characters) can quite often be done in more than one way. In many 
instances it's not even obvious which method is the best in a given 
circumstance. Attempts to base this process on mechanistic rules (driven 
by definitions) are bound to fail.


Hence, characters are the outcome of a creative (human) process of 
analyzing writing systems. Once you have made a particular analysis, 
usually ending in an encoding, the elements thus defined are de facto 
the characters.


If you were to accept that it is impossible to rigorously define 
characters for purposes of making this analysis, the problem becomes 
simpler. Abstract characters are then entities encoded in one (or 
more) character encodings, and character is what character encodings 
encode. Operationally, characters are the smallest units operated on by 
algorithms that process textual data.


Operated on would sidestep the distinctions between characters that 
represent elements of a writing system like A and what Unicode calls 
format controls like RLM (or the segmentation characters like PS, 
LF, TAB).


A bit is not the smallest unit, because the algorithms (as logically 
described) don't operate on bits, they are defined in terms of 
characters (or sequences of characters).


For a fuller definition you might need to make clear that display is 
covered by process and you might find you need to find a way to cover 
the traditional use of control characters. They could be described the 
smallest units operated on by algorithms that control of devices 
displaying text based on data embedded in a text stream.


While there might be some improvement in rewording the glossary entries 
in this way, doing so neither removes the inherent tautology nor does it 
eliminate the fact that characters are very diverse in what they represent.


But it might make clear that no definition of character will ever be 
sufficient to serve as input to the process of deciding the question of 
whether a proposed new entity is or isn't a character.


A./



Re: Unicode 7.0 goals and ++

2011-07-11 Thread Asmus Freytag

On 7/11/2011 11:57 AM, Ken Whistler wrote:

On 7/10/2011 4:58 PM, Ernest van den Boogaard wrote:

For the long term, I suggest Unicode should aim for this:


That kind of terminological purity isn't going to occur.

...


The Unicode Consortium has a glossary of terms:


...



But the Unicode Standard is neither a software system nor a protocol 
stack,

so trying to apply models appropriate to other realms probably isn't going
to get too far.
...



This much is *already* available. S


...
Unicode 9.0 should claim: Processes will be defined and published in 
*UML* 2.0 (for lack of an open standard)
(Background: think UAX #9 Bidi written in a universal -graphic- 
language).


This, on the other hand, is not going to happen. I don't see the 
UTC going for that at all.


--Ken



I might have the numbering wrong, or ever the sequence. But not the 
main line, is it?


Essentially, as Ken points out, this is not the trajectory that one 
would look forward to.


So I would think you're off about what you call the main line.

Not so coincidentally, I fully agree with his conclusions, as well as 
with the reasoning behind them.


A./




Re: Proposed Update UAXes for Unicode 6.1

2011-07-07 Thread Asmus Freytag

On 7/7/2011 8:42 PM, Karl Williamson wrote:

On 07/07/2011 02:33 PM, announceme...@unicode.org wrote:

Proposed updates for most Unicode Standard Annexes for Version 6.1 of
the Unicode Standard have been posted for public review.


Many of the documents appear to have no current modifications to 
review other than placeholders for future changes.




That means they are proposed to remain unaltered, and / or that actual 
proposed changes might still be in the works.


In either case, if you have an issue with the spec as written, now would 
be a good time to provide input that can lead to an improvement, 
correction or extension of the document and / or the corresponding data 
files.


From watching this process over the years, request for incompatible 
changes that aren't a correction of out and out errors will have a tough 
time in committee. Editorial clarifications usually have the best 
success at acceptance, as well as any issues related to improving the 
handling of newly encoded characters. Anything else will be in-between, 
subject to some cost-benefit analysis by the UTC, weighing the putative 
benefits of a change to the cost of not only making it in the documents, 
but also to existing implementations.


A./



Re: unicode Digest V12 #108

2011-07-06 Thread Asmus Freytag

On 7/3/2011 6:31 AM, Philippe Verdy wrote:

Regarfing the previous comment about the Danish aa,


Sorry, most of that discussion missed the mark.

Modern Danish can have AA for two reasons. Accidental occurrence, as 
in dataanalyse which is composed of two words which just happens to 
put two A together. The other is frozen spellings for names and the 
like. In the former case, you can never use å, in the latter case, you 
may not want to.


In the former case, you do not want to sort AA as if it was å, in 
the latter case, you do.


None of that has anything to do with ASCII - it's a question of 
orthographic practices, not of legacy encoding.


Because accidental digraphs (in Danish) happen at word boundaries in a 
compound, the SHY is an elegant way to mark them.


A./


Re: unicode Digest V12 #108

2011-07-06 Thread Asmus Freytag

On 7/6/2011 12:16 AM, Jukka K. Korpela wrote:
Allowing word division just to say that some characters do not 
constitute a digraph (or trigraph…) is not practical e.g. when the 
text has otherwise no word divisions, for one reason or another, or 
when the particular word division point is typographically suboptimal 
or even bad. 

I quite agree. But that's been my position from the start.

In my very first post in this thread I had written:

   ...*if* such split [=word division] *is possible*, I would call it
   [=SHY] the preferred solution to indicating an accidental digraph.

The corollary is that it's not a good thing to use SHY when there's no 
coinciding word division.


True digraphs are usually not word division points, but in any language 
forming compounds, accidental combinations occur at word-division 
boundaries with some frequency.


The Danes, over a decade ago, when they made the official recommendation 
to use SHY appear to have come to the conclusion that AA can never 
occur accidentally, except at word division in compounds.


A./


Re: unicode Digest V12 #108

2011-07-02 Thread Asmus Freytag

On 7/2/2011 8:59 AM, Philippe Verdy wrote:

2011/7/2 Andrew Millera.j.mil...@bcs.org.uk:

The ng in Llangollen is not the digram ng but two separate letters
(unlike the ll in the name which is the digram).

Why not simply using a soft hyphen between n and g in this case ?
Soft hyphens are normally recognized as such by smart correctors and
as well by search engines or collators. It seems enough for me to
indicate that this is not the Welsh digram ng ; CGJ anyway is
certainly not the correct disjoiner in your case.



This solution works well if the word can split between the n and the g.

In fact, if such split is possible, I would call it the preferred 
solution to indicating an accidental digraph.


An example:

The Danish digraph aa, normally spelled å in modern orthography, but 
retained in names etc. can occur accidentally in compound nouns, such 
as dataanalyse. Adding a SHY is the preferred method to indicate that 
the aa is accidental.


Other characters may have the same effect of breaking the digraph, their 
use might require an *additional* SHY to be inserted, if and when a 
linebreak opportunity needs to be manually marked (say for an unusual 
compound not recognized by the automatic hyphenator). It would be bad to 
have to have *two* invisible characters at that location.





Re: Typo in bidi reference implementation

2011-07-01 Thread Asmus Freytag

On 7/1/2011 12:06 AM, Peter Krefting wrote:

Hi!

On line 65 of 
http://www.unicode.org/Public/PROGRAMS/BidiReferenceCpp/bidi.cpp 
(version 26) the word utility is spelled as uitlity (line 80 has 
the correct spelling).


Not that it matters much, just something we noticed.

If it's in a comment, and easily corrected by the reader, I'd lean 
towards not touching the file. Definitely not something for which one 
would want to release (and test) a new version.


But we could fix the sources so that *if* there's ever a new version, it 
won't repeat the same issue.

A./


Re: Latin IPA letter a

2011-06-28 Thread Asmus Freytag

On 6/28/2011 1:51 AM, Michael Everson wrote:

On 28 Jun 2011, at 09:28, Jean-François Colson wrote:


In Times New Roman, which is the default font for MS Word (probably the best 
known word processor), the letters “a” and “ɑ” are indistinguishable in italics.

That is a fault of the font.


No, the font does what it's supposed to, which is to give the correct 
rendering of the letter 'a' for use in ordinary text. The problem is in 
Unicode's unification of the generic letter a with the IPA letter 'a', 
which has a restricted glyph range, and, as we now find out, must be 
treated differently when styles are applied.


Encoding a new character is not the answer. However, encoding a 
standardized variation sequence would be the proper answer. Insisting 
that people have control over the font with which text is viewed is to a 
degree illusory. Not recognizing that fact is a weakness in Unicode's 
1980's based design in this instance.


A standardized variation sequence makes the IPA nature of the IPA 'a' 
more portable, while at the same time cleanly allowing text processing 
software to treat it like the ordinary 'a', when needed, by simply 
ignoring the variation selector.


Why this can be addressed for Han ideographs to the n-th degree, but the 
few egregious instances of required glyphic subset restrictions can't be 
made portable for Latin escapes me totally.


Time for Unicode to be brought into the 21st century in that respect.

A./





Re: Unifon

2011-06-28 Thread Asmus Freytag

On 6/28/2011 1:40 AM, Andreas Stötzner wrote:


Am 28.06.2011 um 09:43 schrieb Jean-François Colson:

I’m interested in Unifon (http://www.unifon.org). That’s a phonemic 
alphabet for English which is used to teach reading.
Although it has been encoded in the ConScript Unicode Registry as a 
new script in a three-columns block, it has in fact been designed as 
an extension of the Latin alphabet.
Therefore, considering that three fifths of its letters are already 
available, I wonder whether a proposal shouldn’t be limited to the 16 
missing letters.

What’s your opinion?



Is there a real need for regular encoding?
If proposed as kind of extension to Latin there will be one issue at 
least to be considered carefully: Unifon does not fit the Latin 
Writing system since it is unicameral, not bicameral (as far as I can 
see).


Same restriction applies to IPA and phonetic notations, all of which 
have been unified with Latin as far as common letters are concerned.
By which I doubtlessly not intend at all to encourage any of the 
enthusiasts to think they ought now go to their desks and try to 
invent new lowercase glyphs.





More relevant would be who uses this system, where and how widely.

The answer to those questions decides, among others, whether any 
standardization effort is warranted.


A./


Re: UNICODE version of _T(x) macro

2010-11-23 Thread Asmus Freytag

On 11/23/2010 1:58 AM, sowmya satyanarayana wrote:



This what I am actually looking for. My ODBC application supports 
UTF-16, which is 2 byte width characters. This application is 
completely oriented around using _T(x) macro as Asmus Freytag figured out.


Yeah, it's nice when you can do without, but if your code is filled with 
_T() macros for function arguments or static initializes, you've got to 
find a way to make it work.


In 2003 there was an attempt to introduce ux to mean treat x as 
UTF-16 (and Ux to mean treat X as UTF-32).


With these extensions beyond L you can write a _T() macro to be 
precisely UTF-16.


I don't know whether any recent compilers, especially in the Unix world 
have taken up that convention, but it's worth a try to check out whether 
that solves your problem. At the same time there was an attempt to 
introduce char16_t and char32_t with guaranteed size and support as 
UTF-16 and UTF-32. If your compiler supports these, then it may support 
u and U for initializers.


Otherwise, I'm afraid you may be stuck with your solution - but the 
problem is that you introduce temporary allocations and have memory 
lifetime issues. I think your sample code would leak memory.


In C++ you can define a simple object that's useful for wrapping static 
strings that are used as function arguments - the object will live just 
as long as needed (i.e. until the function returns). For other strings, 
your objects would have to be of global scope. But it's a pain nevertheless.


A./



Re: Are Latin and Cyrillic essentially the same script?

2010-11-22 Thread Asmus Freytag

On 11/22/2010 4:15 AM, Michael Everson wrote:

It boils down to this: just as there aren’t technical or usability reasons that 
make it problematic to represent IPA text using two Greek characters in an 
otherwise-Latin system,

Yes there are. Sorting multilingual text including Greek and IPA 
transcriptions, for one. The glyph shape for IPA beta is practically unknown in 
Greek. Latin capital Chi is not the same as Greek capital chi.


  so also there are no technical or usability reasons I’m aware of why it is 
problematic to represent this historic Janalif orthography using two Cyrillic 
characters.

They are the same technical and usability reasons which led to the 
disunification of Cyrillic Ԛ and Ԝ from Latin Q and W.


The sorting problem I think I understand.

Because scripts are kept together in sorting, when you have a mixed 
script list, you normally overrides just the sorting for the script to 
which the (sort-)language belongs. A mixed French-Russian list would use 
French ordering for the Latin characters, but the Russian words would 
all appear together (and be sorted according to some generic sort order 
for Cyrillic characters - except that for a bilingual list, sorting the 
Cyrillic according to Russian rules might also make sense.).


Same for a French-Greek list. The Greek characters will be together and 
sorted either by a generic Greek (script) sort, or a specific Greek 
(language) sort.When you sort a mixed list of IPA and Greek, the beta 
and chi will now sort with the Latin characters, in whatever sort order 
applies for IPA. That means the order of all Greek words in the list 
will get messed up. It will neither be a generic Greek (script) sort, 
nor a specific Greek (language) sort, because you can't tailor the same 
characters two different ways in the same sort.


That's the problem I understand is behind the issue with the Kurdish Q 
and W, and with the character pair proposed for disunification for Janalif.


Perhaps, it seems, there are some technical problems that would make the 
support for such mixed-script orthographies not as seamless as for 
regular orthographies after all.


In that case, a decision would boil down to whether these technical 
issues are significant enough (given the usage).


In other words, it becomes a cost-benefit analysis. Duplication of 
characters (except where their glyphs have acquired a different 
appearance in the other context) always has a cost in added 
confusability. Users can select the wrong character accidentally, 
spoofers can do so intentionally to try to cause harm. But Unicode was 
never just a list of distinct glyphs, so duplication between Latin and 
Greek, or Latin and Cyrillic is already widespread, especially among the 
capitals.


Unlike what Michael claims for IPA, the Janalif characters don't seem to 
have a very different appearance, so there would not be any technical or 
usability issue there. Minor glyph variations can be handled by standard 
technologies, like OpenType, as long as the overall appearance remains 
legible should language binding of a text have gotten lost.


That seems to be true for IPA as well - because already, if you use the 
font binding for IPA, your a's and g's will not come out right, which 
means you don't even have to worry about betas and chis.


IPA being a notation, I would not be surprised to learn that mixed lists 
with both IPA and other terms are a rare thing. But for Janalif it would 
seem that mixed Janalif/Cyrillic lists would be rather common, relative 
to the size of the corpus, even if its a dead (or currently out of use) 
orthography.


I'd like to see this addressed a bit more in detail by those who support 
the decision to keep the borrowed characters unified.


A./


Re: UNICODE version of _T(x) macro

2010-11-22 Thread Asmus Freytag

On 11/22/2010 10:18 AM, Phillips, Addison wrote:

sowmya satyanarayanasowmya underscore satyanarayana at yahoo dot
com
wrote:


Taking this, what is the best way to define _T(x) macro of

UNICODE version, so

that my strings will always be
2 byte wide character?

Unicode characters aren't always 2 bytes wide.  Characters with
values
of U+1 and greater take two UTF-16 code units, and are thus 4
bytes
wide in UTF-16.


Not exactly. The code units for UTF-16 are always 16-bits wide. Supplementary 
characters (those with code points= U+1) use a surrogate pair, which are 
two 16-bit code units. Most processing and string traversal is in terms of the 
16-bit code units, with a special case for the surrogate pairs.

It is very useful when discussing Unicode character encoding forms to distinguish between characters 
(code points) and their in memory representation (code units), rather than using 
non-specific terminology such as character.

If you want to use UTF-32, which uses 32-bit code units, one per code point, 
you can use a 32-bit data type instead. Those are always 4 bytes wide.


The question is relevant to the C and C++ languages.

What is asked: which native data type to I use to make sure I end up 
with a 16-bit code unit.


The usual way a _T macro is used is

TCHAR x = _T('x');
TCHAR * x = _T(x);

that is to wrap a string or character literal so that it can be used 
either as Unicode literal or as non-Unicode literal, depending on 
whether some global compile time flat (usually UNICODE or _UNICODE) is 
set or not.


The usual way a _T macro is defined is something like:

#ifdef UNICODE
#define _T(x) L##x
#else
#define _T(x) x
#endif

That defintion relies on the compiler to support L'x' or Lstring by 
using UTF-16.


A few years ago, there was a proposal to amend the C standard to have a 
way to ensure that this is the case in a cross platform way. I can't 
recall offhand what became of it.


A./



Re: UNICODE version of _T(x) macro

2010-11-22 Thread Asmus Freytag

On 11/22/2010 11:08 AM, Asmus Freytag wrote:
depending on whether some global compile time flat (usually UNICODE or 
_UNICODE) is set or not. 

recte: flag.


Re: Are Latin and Cyrillic essentially the same script?

2010-11-19 Thread Asmus Freytag

On 11/18/2010 11:15 PM, Peter Constable wrote:

If you'd like a precedent, here's one:


Yes, I think discussion of precedents is important - it leads to the 
formulation of encoding principles that can then (hopefully) result in 
more consistency in future encoding efforts.


Let me add the caveat that I fully understand that character encoding 
doesn't work by applying cook-book style recipes, and that principles 
are better phrased as criteria for weighing a decision rather than as 
formulaic rules.


With these caveats, then:

  IPA is a widely-used system of transcription based primarily on the Latin 
script. In comparison to the Janalif orthography in question, there is far more 
existing data. Also, whereas that Janalif orthography is no longer in active 
use--hence there are not new texts to be represented (there are at best only 
new citations of existing texts), IPA is as a writing system in active use with 
new texts being created daily; thus, the body of digitized data for IPA is 
growing much more that is data in the Janalif orthography. And while IPA is 
primarily based on Latin script, not all of its characters are Latin 
characters: bilabial and interdental fricative phonemes are represented using 
Greek letters beta and theta.


IPA has other characteristics in both its usage and its encoding that 
you need to consider to make the comparison valid.


First, IPA requires specialized fonts because it relies on glyphic 
distinctions that fonts not designed for IPA use will not guarantee. 
(Latin a with and without hook, g with hook vs. two stories are just two 
examples). It's also a notational system that requires specific training 
in its use, and it is caseless - in distinction to ordinary Latin script.


While several orthographies have been based on IPA, my understanding is 
that some of them saw the encoding of additional characters to make them 
work as orthographies.


Finally, IPA, like other phonetic notations, uses distinctions between 
letter forms on the character level that would almost always be 
relegated to styling in ordinary text.


Because of these special aspects of IPA, I would class it in its own 
category of writing systems which makes it less useful as a precedent 
against which to evaluate general Latin-based orthographies.



Given a precedent of a widely-used Latin writing system for which it is 
considered adequate to have characters of central importance represented using 
letters from a different script, Greek, it would seem reasonable if someone 
made the case that it's adequate to represent an historic Latin orthography 
using Cyrillic soft sign.


I think the question can and should be asked, what is adequate for a 
historic orthography. (I don't know anything about the particulars of 
Janalif, beyond what I read here, so for now, I accept your 
categorization of it as if it were fact).


The precedent for historic orthographies is a bit uneven in Unicode. 
Some scripts have extensive collection of characters (even duplicates or 
near duplicates) to cover historic usage. Other historic orthographies 
cannot be fully represented without markup. And some are now better 
supported than at the beginning because the encoding has plugged certain 
gaps.


A helpful precedent in this case would be that of another minority or 
historic orthography, or historic minority orthography for which the use 
of Greek or Cyrillic characters with Latin was deemed acceptable. I 
don't think Janalif is totally unique (although the others may not be 
dead). I'm thinking of the Latin OU that was encoded based on a Greek 
ligature, and the perennial question of the Kurdish Q an W (Latin 
borrowings into Cyrillic - I believe these are now 051A and 051C). 
Again, these may be for living orthographies.


   /Against this backdrop, it would help if WG2 (and UTC) could point
   to agreed upon criteria that spell out what circumstances should
   favor, and what circumstances should disfavor, formal encoding of
   borrowed characters, in the LGC script family or in the general case./


That's the main point I'm trying to make here. I think it is not enough 
to somehow arrive at a decision for one orthography, but it is necessary 
for the encoding committees to grab hold of the reasoning behind that 
decision and work out how to apply consistent reasoning like that in 
future cases.


This may still feel a little bit unsatisfactory for those whose proposal 
is thus becoming the test-case to settle a body of encoding principles, 
but to that I say, there's been ample precedent for doing it that way in 
Unicode and 10646.


So let me ask these questions:

   A. What are the encoding principles that follow from the disposition
   of the Janalif proposal?

   B. What precedents are these based on resp. what precedents are
   consciously established by this decision?


A./




Re: Are Latin and Cyrillic essentially the same script?

2010-11-18 Thread Asmus Freytag

On 11/18/2010 8:04 AM, Peter Constable wrote:

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of André Szabolcs Szelp


AFAIR the reservations of WG2 concerning the encoding of Jangalif
Latin Ь/ь as a new character were not in view of Cyrillic Ь/ь, but
rather in view of its potential identity with the tone sign mentioned
by you as well. It is a Latin letter adapted from the Cyrillic soft sign,

There's another possible point of view: that it's a Cyrillic character that, 
for a short period, people tried using as a Latin character but that never 
stuck, and that it's completely adequate to represent Janalif text in that 
orthography using the Cyrillic soft sign.




When one language borrows a word from another, there are several stages 
of foreignness, ranging from treating the foreign word as a short 
quotation in the original language to treating it as essentially fully 
native.


Now words are very complex in behavior and usage compared to characters. 
You can check for pronunciation, spelling and adaptation to the host 
grammar to check which stage of adaptation a word has reached.


When a script borrows a letter from another, you are essentially limited 
in what evidence you can use to document objectively whether the 
borrowing has crossed over the script boundary and the character has 
become native.


With typographically closely related scripts, getting tell-tale 
typographical evidence is very difficult. After all, these scripts 
started out from the same root.


So, you need some other criteria.

You could individually compare orthographies and decide which ones are 
important enough (or established enough) to warrant support. Or you 
could try to distinguish between orthographies for general use withing 
the given language, vs. other systems of writing (transcriptions, say).


But whatever you do, you should be consistent and take account of 
existing precedent.


There are a number of characters encoded as nominally Latin in Unicode 
that are borrowings from other scripts, usually Greek.


A discussion of the current issue should include explicit explanation of 
why these precedents apply or do not apply, and, in the latter case, why 
some precedents may be regarded as examples of past mistakes.


By explicitly analyzing existing precedents, it should be possible to 
avoid the impression that the current discussion is focused on the 
relative merits of a particular orthography based on personal and 
possibly arbitrary opinions by the work group experts.


If it can be shown that all other cases where such borrowings were 
accepted into Unicode are based on orthographies that are more 
permanent, more widespread or both, or where other technical or 
typographical reasons prevailed that are absent here, then it would make 
any decision on the current request seem a lot less arbitrary.


I don't know where the right answer lies in the case of Janalif, or 
which point of view, in Peter's phrasing, would make the most sense, but 
having this discussion without clear understanding of the precedents 
will lead to inconsistent encoding.


A./



Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Asmus Freytag

On 11/15/2010 2:24 PM, Kenneth Whistler wrote:

FA47 is a compatibility character, and would have a compatibility mapping.

Faulty syllogism.


Formally correct answer but only because of something of a design flaw 
in Unicode. When the type of mapping was decided on, people didn't fully 
expect that NFC might become widely used/enforced, making these 
distinctions appear wherever text is normalized in a distributed 
architecture.

FA47 is a CJK Compatibility character, which means it was encoded
for compatibility purposes -- in this case to cover the round-trip
mapping needed for JIS X 0213.

However, it has a *canonical* decomposition mapping to U+6F22.


And that, of course, destroys the desired round-trip behavior if it is 
inadvertently applied while the data are encoded in Unicode. Hence the 
need to recreate a solution to the issue of variant forms with a 
different mechanism, the ideographic variation sequence (and 
corresponding database).




The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.

Easily verified, for example, by checking the FA47 entry in
NormalizationTest.txt in the UCD.


While correct, it's something that remains a bit of a gotcha. Especially 
now that Unicode has charts that go to great length showing the 
different glyphs for these characters, I would suggest adding a note to 
the charts that make clear that these distinctions are *removed* anytime 
the text is normalized, which, in a distributed architecture may happen 
anytime.


A./

--Ken


When I type ... (U+FA47) into BabelPad, highlight it, and then
click the button labeled Normalize to NFC, the character
becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard
in this case? ...








Re: CJK Compatibility Gotchas (was: Re: Application that displays CJK text in Normalization Form D

2010-11-15 Thread Asmus Freytag

On 11/15/2010 5:43 PM, Kenneth Whistler wrote:

Perhaps someone would like to make a detailed proposal to
the UTC for how to fix the text and charts?;-)


Ken,

having shown yourself the master of detail in your reply, I think you've 
appointed yourself.


A round of applause for Ken!

See how easy that was? :)

Cheers,

A./

PS: I had something pithy in mind that would work for the charts - I'll 
send that off to the guy who maintains the nameslist.


Re: Application that displays CJK text in Normalization Form D

2010-11-14 Thread Asmus Freytag

On 11/14/2010 12:57 PM, Doug Ewell wrote:

Jim Monty jim dot monty at yahoo dot com wrote:

Japanese kana (the J in CJK) and Korean syllables (the K in 
CJK) both have different normalization forms. What do ideographs 
have to do with anything? I didn't mention ideographs; you did.


The term CJK is often used to refer to those characters which are 
common to Chinese and Japanese and Korean, viz. the ideographic 
characters.


Doug,

you might want to talk to the author of UTN#14 then, because he seems to 
be using the term CJK text in a sense that I find indistinguishable 
from the way Jim did.


Any relation of yours?

:)

A./

PS: I too think that replacing the CJK text with Katakana and Hangul 
as a more specific choice, would have been an improvement- as written it 
makes the problem sound more open-ended than it is. But you guys are 
arguing about an E-mail subject line, of all things




Re: Is there a term for strictly-just-this-encoding-and-not-really-that-encoding?

2010-11-10 Thread Asmus Freytag
If you want to get that point across to a general audience, you could 
use a more colloquial term, albeit one that itself derives from mathematics.


Text that can be completely expressed in ASCII is fits into something 
(ASCII) that works as a lowest common denominator of a large number of 
character sets.


You could call it lowest common denominator text.

Since ASCII is the only set that exhibits such a lowest common 
denominator relationship with enough other sets to make it interesting, 
and since that relation is so well known, it's usually enough to just 
refer to it by name (ASCII) without needing a general term - except 
perhaps for general audiences that aren't very familiar with it.


In this kinds of discussions I find it invariably useful to mention that 
the copyright sign is not part of ASCII. (I suspect that it's the most 
common character that makes a text lose its lowest common denominator 
status).


A./





On 11/10/2010 11:41 AM, Jim Monty wrote:

Here's a peculiar question.

Is there a standard term to describe text that is in some subset CCS of another
CCS but, strictly speaking, is only really in the subset CCS because it doesn't
have any characters in it other than those represented in the smaller CCS?

(The fact that I struggled to phrase this question in a way that made my meaning
clear -- and failed -- is precisely my dilemma.)

Text that has in it only characters that are in the
ASCII character encoding is also in the ISO 8859-1 character encoding and the
UTF-8 character encoding form of the Unicode coded character set, right? I often
need to talk and write about text that has such multiple personalities, but I
invariably struggle to make my point clearly and succinctly. I wind up
describing the notion of it in awkwardly verbose detail.

So I'm left wondering if the character encoding cognoscenti have a special
utilitarian word for this, maybe one borrowed from mathematics (set theory).

Jim Monty









Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Asmus Freytag

On 11/4/2010 5:46 PM, Doug Ewell wrote:

Markus Scherer wrote:

While processing 16-bit Unicode text which is not assumed to be 
well-formed UTF-16, you can treat (decode) an unpaired surrogate as a 
mostly-inert surrogate code point. However, you cannot unambiguously 
encode a surrogate code point in 16-bit text (because you could not 
distinguish a sequence of lead+trail surrogate code points from one 
supplementary code point), and therefore it is not allowed to encode 
surrogate code points in any well-formed UTF-8/16/32. [All of this is 
discussed in The Unicode Standard, Chapter 3.]


I'm probably missing something here, but I don't agree that it's OK 
for a consumer of UTF-16 to accept an unpaired surrogate without 
throwing an error, or converting it to U+FFFD, or otherwise raising a 
fuss. Unpaired surrogates are ill-formed, and have to be caught and 
dealt with.




The question is whether you want every library that handles strings 
perform the equivalent of a citizen's arrest, or whether you architect 
things that the gatekeepers (border control) police the data stream.


During development, early and widespread error detection is helpful in 
debugging. After that, it's probably better to concentrate handling 
these errors, because that would tend to improve your options for 
implementing successful error recovery.


Malformed data shouldn't get in and shouldn't get perpetuated, but in 
the general case, there should be a facility for repairing faulty 
data, wherever that is reasonably possible.


In the context of uppercasing a string, for example, repair is not a 
reasonable option, neither is rejecting the string at that point - it 
should have been rejected / repaired much earlier.


A./



Re: Utility to report and repair broken surrogate pairs in UTF-16 text

2010-11-05 Thread Asmus Freytag

On 11/5/2010 7:02 AM, Doug Ewell wrote:

Asmus Freytagasmusf at ix dot netcom dot com  wrote:


I'm probably missing something here, but I don't agree that it's OK
for a consumer of UTF-16 to accept an unpaired surrogate without
throwing an error, or converting it to U+FFFD, or otherwise raising a
fuss. Unpaired surrogates are ill-formed, and have to be caught and
dealt with.

The question is whether you want every library that handles strings
perform the equivalent of a citizen's arrest, or whether you architect
things that the gatekeepers (border control) police the data stream.

If you can have upstream libraries check for unpaired surrogates at the
time they convert UTF-16 to Unicode code points, then your point is well
taken, because then the downstream libraries are no longer dealing with
UTF-16, but with code points.  Doing conversion and validation at
different stages isn't a great idea; that's how character encodings get
involved with security problems.


Note that I am careful not to suggest that (and I'm sure Markus isn't 
either). Handling includes much more than code conversion. It includes 
uppercasing, spell checking, sorting, searching, the whole lot. 
Burdening every single one of those tasks with policing the integrity of 
the encoding seems wasteful, and, as I tried to explain, puts the error 
detection in a place where you'll be most likely prevented from doing 
something useful in recovery.


Data import or code conversion routines are in a much better place, 
architecturally, to allow the user meaningful options to deal with 
corrupted data, from rejecting to attempts of repair.


However, some tasks, such as network identifier matching, are 
security-sensitive and must re-validate their input, even if the data 
has already passed a gate keeper routine such as a validating code 
conversion routine.



Corrigendum #1 closed the door on interpretation of invalid UTF-8
sequences.  I'm not sure why the approach to handling UTF-16 should be
any different.






Re: A simpler definition of the Bidi Algorithm

2010-10-17 Thread Asmus Freytag

 On 10/17/2010 7:01 AM, Michael D. Adams wrote:

This is something that not even the C++ and Java reference
implementations do (though it appears that the C++ implementation of
the W rules was originally derived from a regular expression as it
uses state tables, but if so it is undocumented).  (Which by the way
they have not been proven to be equivalent, they have merely been
tested.  Proof is a much more complicated formalism.)
Having written the C++ reference implementation, I know a thing or two 
about it :)


The two implementations were not formally proven to be equivalent - in a 
way that wasn't the purpose. The purpose was to see whether several (in 
this case two) implementers could read the rules and come up with the 
same results.


When someone creates a real-world implementation to a specification like 
the Bidi Algorithm, it's not usually verified by formal proof, but by 
testing. Therefore, the exercise had to with finding out what level of 
testing was sufficient to capture inadvertent misapplication of some of 
the less-well-worded rules. (Some of them have since been rewritten to 
make their intent less ambiguous).


Most of the issues were found with the test pass that simply compared 
all possible sequences up to length 6. That is better than the 
BidiTest.txt file, which I understand only goes to length 4. Stochastic 
sampling of sequences up to length 20 resulted in fewer reported 
discrepancies - again, all of this is from memory.


For the test, the maximal depth of embeddings was set to 15 instead of 
63, and the input were strings of bidi classes, not raw characters - 
therefore cutting down on the number of possible sequences.


The Java implementation was deliberately designed to be transparent - 
matching the way the rules are formulated in an obvious way. For the C++ 
implementation I wanted to do something different, and possibly faster, 
so I hand-coded a few state tables. The biggest challenge was not in 
creating those tables, but in understanding the nuances of the rules, by 
the way.


The situation today is not fully comparable, since there was some 
feedback from the reference implementation project to the rules and 
especially their wording.


A./



Re: A simpler definition of the Bidi Algorithm

2010-10-17 Thread Asmus Freytag

 On 10/17/2010 10:59 AM, Michael D. Adams wrote:

The biggest challenge was not in creating those tables, but in
understanding the nuances of the rules, by the way.

Two questions so I can understand better.

First, by nuances do you mean the nuances of how the rules interact
(which I think would be simplified by using a definition as I have
proposed) or some other nuance?

Neither - as they evolved over time, the rules were revised to more 
clearly state how to handle certain edge cases and to remove language 
that could be (and had been) misinterpreted. In other words, the 
statement of the rules has improved. Now that we have a field-tested set 
of rules, it's of course easy to re-write them, because you can be 
certain to know what they mean.


Perhaps by going your route, we would have arrived at the same result. 
Who knows. That's the difference between theory and history. History 
takes one, and only one of the possible paths to get to a result, and it 
doesn't give a bit about whether that path was optimal.


If you'd been a contributor then, history might well have proceeded 
differently.


Cheers,

A./



Re: [unicode] Telugu Unicode Encoding Review

2010-10-16 Thread Asmus Freytag

 On 10/16/2010 10:38 AM, suzuki toshiya wrote:

Hi,

I've never heard any comments about the reservation
of the codepoints to making the code chart structure
similar among multiple script, no posive, no negative.
So your comment is interesting. Could you tell me more
about what kind of disadvantages you're thinking of?


The source for this arrangement is an Indian National Standard.

As chapter 9 of TUS states in the introduction:

   They are all encoded according to a common plan, so that comparable
   characters
   are in the same order and relative location. This structural
   arrangement, which facilitates
   transliteration to some degree, is based on the Indian national
   standard (ISCII).

The important thing to remember is that when Unicode was first created, 
it was seen as very important to mimic the layout of 8-bit character 
sets for a given script - at least for those scripts that had fairly 
well established standards in the 80s.


While this seems quaint now, it did make it easier for people to become 
comfortable with Unicode - and to be able to tell quickly and reliably 
whether important character sets were fully covered. Without that, 
Unicode might never have established itself - as unbelievable as that 
may sound to those who did not experience that transition period first hand.


A./




If Telugu users are using 7-bit or 8-bit encoding
and they want to use more codepoints for unencoded
characters, the disadvantage (the reduction of the
available codepoint) is clear. But... you're talking
about Unicode.

Regards,
mpsuzuki

Kiran Kumar Chava wrote (2010/10/17 2:06):

Hi,


At the link, http://geek.chavakiran.com/archives/55 , I tried to 
understand
Telugu Unicode encoding and then I tried to do an out of box review 
of this
encoding. Kindly let me know if I am missing something, mentioned as 
missing

in above article are really missing or not. Any other views...


Thanks in advance,

Kiran Kumar Chava

http://chavakiran.com









Re: statistics

2010-10-12 Thread Asmus Freytag

 On 10/11/2010 9:49 PM, Janusz S. Bień wrote:

On Mon, 11 Oct 2010  announceme...@unicode.org wrote:


  The newly finalized Unicode Version 6.0 adds 2,088 characters,

What is the current total? Are other statistic informations available
somewhere?

The announcement gives a link to click through.

There you will find more statistics.

A./

Best regards

JSB






Re: Irrational numeric values in TUS

2010-10-12 Thread Asmus Freytag

 Ken,

some comments, and a few suggestions near the end.


On 10/12/2010 4:56 PM, Kenneth Whistler wrote:

Karl Williamson asked:


The Unicode standard only gives numeric values to rational numbers.  Is
the reason for this merely because of the difficulty of representing
irrational ones?

No. Primarily it is because the Unicode Standard is a *character*
encoding standard, and not a standard for numeric values for
various mathematical constants that some characters might be
used to represent.

Correct.



I consider EULER CONSTANT an unfortunate misnomer from the
very, very early days of the Unicode Standard. If we had it to
do over, particularly given the later addition of all the
styled mathematical alphanumerics, I would have favored:

2107 [insert stylename here] CAPITAL E
   = Euler constant

Or something similar -- just to make the point clearer.
Actually, what you advocate here is what I consider the mistake that was 
made with the WEIERSTRASS ELLIPTIC FUNCTION. The problem is that the 
Letterlike Symbols were conflated with styled letters used as symbols. 
They are not at all the same category. The Planck constant is a styled 
letter used as symbol, and is correctly unified with the italic h, but 
the planck constant / (2 * pi), or h-bar is not a styled letter but a 
symbol derived from a styled letter - a true letterlike symbol.


2107 and 2118 are one-off designs, not part of complete sets, same as 210F.

Because these characters came from not-well-understood legacy 
collections, and because the styled letters used as symbols were 
initially deemed inadmissible to Unicode as complete sets these 
distinctions weren't clear at the time.

  NamesList.txt
says that U+03C0, GREEK SMALL LETTER PI is used for the ratio of a
circle's circumference to its diameter, but it has other uses as well,
and does not have the Math property.

Having the Math property basically has nothing to do with whether
a character is assigned a Numeric_Value or not.


Correct.

The various Math PI's don't seem
that they necessarily mean this value either.  Things like the two
characters that have Planck's constant in their names, even if the
code points always meant that, have different values in different
measurement systems, so couldn't be said to refer to particular numbers.

I'm curious if any thought was given to this, and what code points I'm
missing in my analysis.

U+1D452 MATHEMATICAL ITALIC SMALL E (or merely U+0065 LATIN
SMALL LETTER E), also used for Euler's number. See also U+2147.


Now you are confusing Euler's constant - also depicted with U+03B3 GREEK 
SMALL LETTER GAMMA, with the natural exponent. That kind of confusion is 
really not helpful and is what drives people like Karl to ask for 
numeric property values in the first place - to unambiguously define 
what these symbols were encoded for.


The proper place to document that, without introducing a formal 
property, is with additional nameslist annotation for a few characters.


I suggest that you add the correct value for Euler's constant as a 
comment and cross reference that character it to 03B3


0.57721 56649 01532 86060 65120 90082 40243 10421 59335 93992

should be approximate enough...?

At the same time you could add a comment e ≈ 2.718 for 212F - Again, not 
to document the value, but to make clear, beyond the character name, 
what constant the alias for 212F denotes.



For that matter, why stop with irrationals? There is
also U+1D456 MATHEMATICAL ITALIC SMALL I (or merely U+0069 LATIN
SMALL LETTER I), used for the imaginary number, square root
of -1. See also U+2148 and U+2149.

Basically, there is no end to how mathematicians may end up
assigning odder and more exotic kinds of numbers to various
symbols available in the standard. And I think how they do
so and exactly what those values mean is basically out of
scope of the Unicode Standard.



Correct - it's not Unicode's role to make the assignment, but common 
usage can and should be documented informally - that's no different to 
documenting modifier letters with detailed linguistic usage.


A./




Re: 00B7 vs. 2027

2010-09-18 Thread Asmus Freytag

 On 9/18/2010 8:36 AM, abysta wrote:

Hello.

I need a dot to separate words into syllables. What should I use, 00B7 or 2027, 
and why?



2027 is explicitly intended to be used to show syllables as is done in 
dictionaries. You don't make it explicit in your query, but it sounds 
like that is the purpose you are looking for. So, don't hesitate, use 2027.


The nice thing about 2027 is that you can always filter it back out, 
because, by intent, it is not part of the word.


A./



Re: 00B7 vs. 2027

2010-09-18 Thread Asmus Freytag

 On 9/18/2010 10:56 AM, Lorna Priest wrote:




U+00B7 MIDDLE DOT is semantically ambiguous and has (partly 
therefore) varying renderings, and it might be used as a replacement 
for U+2027 if the latter cannot be used reliably.



What about using U+02D1 - half triangular colon?


Why not use the character that was added to Unicode precisely for the 
purpose?


A./



Re: A simpler definition of the Bidi Algorithm

2010-09-10 Thread Asmus Freytag
 The first discussions that lead to the current formulation of the bidi 
algorithm easily go back 20 years by now. There's some value in not 
re-stating a specification - even if a new formulation could be found to 
be 100% equivalent. That value lies in the fact that any reader can 
tell, by simple inspection, that the specification hasn't changed, and 
that implementations that claim conformance to earlier versions of the 
specification are indeed still conformant to later versions.


This point is particularly important for the bidi algorithm, because of 
it's mandatory nature and the fact that it gets re-issued with a new 
version number every time that the underlying Unicode standard gets a 
new version (because of new characters added, etc).


That does not preclude other, equivalent formulations of the algorithm, 
whether in text books or, perhaps, as technical Note. But the burden is 
on the creators of these other formulations to show that their 
supposedly easier or more didactic presentation is indeed equivalent.


Having said that, there are already two other formulations of the 
algorithm that are proven to be equivalent to each other (and have not 
proven to deviate from the written algorithm). I'm referring of course 
to the C++ (http://www.unicode.org/Public/PROGRAMS/BidiReferenceCpp/) 
and Java reference implementations.


A./

PS: Personally, I don't find the presentation in terms of the regular 
expressions any more intuitive than the original.





Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-06 Thread Asmus Freytag

On 8/6/2010 2:03 AM, William_J_G Overington wrote:

On Thursday, 5 August 2010, Kenneth Whistler k...@sybase.com wrote:
 
  

I am thinking of where a poet might specify an ending version of a glyph at the 
end of the last word on some lines, yet not on others, for poetic effect. I 
think that it would be good if one could specify that in plain text.
  
 
  

Why can't a poet find a poetic means of doing that, instead of depending on a 
standards organization to provide a standard means of doing so in plain text? 
Seems kind of anti-poetic to me. ;-)

 
  

--Ken

 
Well, I was just suggesting an example. I am not an expert on poetry.
  

What you mean are artistic or stylistic variants.

These have certain problems, see here for an explanation: 
http://www.unicode.org/forum/viewtopic.php?p=221#p221


A./
 
  





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-05 Thread Asmus Freytag

On 8/5/2010 3:47 AM, William_J_G Overington wrote:

On Wednesday 4 August 2010, Asmus Freytag asm...@ix.netcom.com wrote:
 
  

However, there's no need to add variation sequences to
select an *ambiguous* form. Those sequences should be
removed from the proposal.

 
Are you here talking about such things as alternate glyph styles?
  
No, I am referring to the element of the proposal that proposes to have 
a variation sequence that selects the unspecified form for lower case a.
 
It depends what one means by need.
  
I've written a longer answer here: 
http://www.unicode.org/forum/viewtopic.php?f=9t=83start=0


A./
 
  





Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters

2010-08-04 Thread Asmus Freytag

On 8/2/2010 5:04 PM, Karl Pentzlin wrote:

I have compiled a draft proposal:
Proposal to add Variation Sequences for Latin and Cyrillic letters
The draft can be downloaded at:
 http://www.pentzlin.com/Variation-Sequences-Latin-Cyrillic2.pdf (4.3 MB).
The final proposal is intended to be submitted for the next UTC
starting next Monday (August 9).

Any comments are welcome.

- Karl Pentzlin

  
This is an interesting proposal to deal with the glyph selection problem 
caused by the unification process inherent in character encoding.


When Unicode was first contemplated, the web did not exist and the 
expectation was that it would nearly always be possible to specify the 
font to be used for a given text and that selecting a font would give 
the correct glyph.


As the proposal noted, universal fonts and viewing documents on other 
platforms and systems across the web have made this solution 
unattractive for general texts.


We are left then with these five scenarios

1) Free variation
2) Orthographic variation of isolated characters (by language, e.g. 
different capitals)
3) Orthographic variation of entire texts (e.g. italic Cyrillic forms, 
by language)

4) Orthographic variation by type style (e.g. Fraktur conventions)
5) Notational conventions (e.g. IPA)

For free variation of a glyph, the only possible solutions are either 
font selection or use of a variation sequence. I concur with Karl, that 
in this case, where notable variations have been unified, that adding 
variation selectors is a much more viable means of controlling authorial 
intent than font selection.


If text is language tagged, then Opentype mechanisms exist  in principle 
to handle scenario 2 and 3. For full texts in a certain language, using 
variation selectors throughout is unappealing as a solution.


However, it may be a viable solution for being able to embed correctly 
rendered citations in other text, given that language tagging can be 
separated from the document and that automatic language tagging may 
detect large chunks of text, but not short runs.


The Fraktur problem is one where one typestyle requires additional 
information (e.g. when to select long s) that is not required for 
rendering the same text in another typestyle. If it is indeed desirable 
(and possible) to create a correctly encoded string that can be rendered 
without further change automatically in both typestyles, then adding any 
necessary variation sequences to ensure that ability might be useful. 
However, that needs to be addressed in the context of a precise 
specification of how to encode texts so that they are dual renderable. 
Only addressing some isolated variation sequences makes no sense.


Notational conventions are addressed in Unicode by duplicate encoding 
(IPA) or by variation sequences. The scheme has holes, in that it is not 
possible in a few cases to select one of the variants explicitly, 
instead, the ambiguous form has to be used, in the hope that a font is 
used that will have the proper variant in place for the ambiguous form.


Adding a few variation sequences (like the one to allow the a at 0061 
to be the two story one needed for IPA) would fill the gap for times 
when controlling the precise display font is not available.


However, there's no need to add variation sequences to select an 
*ambiguous* form. Those sequences should be removed from the proposal.


Overall a valuable starting point for a necessary discussion.

A./



Re: Standard fallback characters (was: Draft Proposal to add Variation=D=A Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Asmus Freytag

On 8/4/2010 1:30 PM, verdy_p wrote:

Asmus Freytag  wrote:
  
The Fraktur problem is one where one typestyle requires additional 
information (e.g. when to select long s) that is not required for 
rendering the same text in another typestyle. If it is indeed desirable 
(and possible) to create a correctly encoded string that can be rendered 
without further change automatically in both typestyles, then adding any 
necessary variation sequences to ensure that ability might be useful. 
However, that needs to be addressed in the context of a precise 
specification of how to encode texts so that they are dual renderable. 
Only addressing some isolated variation sequences makes no sense.



I don't think so.

If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but 
even in this case, the conversion to long s will be inappropriate. So use the Fraktur 
round s directly.
  
This statement makes clear that you don't understand the rules of 
typesetting text in Fraktur.
If a text in Fraktur absolutely requires the long s, it's only when the original text was already using this long s. 

This statement is also incorrect.

The rules when to use long s in Fraktur and when to use round s depend 
on the position of the character within the word in complicated ways.


The same word, typeset using Antiqua style will not usually have the long s.

For German, there exist a large number of texts that were typeset in 
both formats, so you can compare for yourself. Even in France, I suspect 
that research libraries would have editions of 19th century German 
classics in both formats.

In that case, encode the long s: The text will render with a long s in both modern Latin font styles like Bodoni 
(with a possible fallback to modern round s if that font does not have a long s), an in classic Fraktur font 
styles (with here also a possible fallback to Fraktur round s if the Frakut font forgets the long s in its repertoire of supported 
glyphs).
  
I'm skipping the rest, of your message because you've started from a 
wrong premise and sorting out which bits still apply even after 
accounting for the wrong premise is not something I have time, energy 
and inclination for. 


Sorry,

A./
  





Re: Re:=D=A Standard fallback characters (was: Draft Proposal to add Variation� Sequences for Latin and Cyrillic letters)

2010-08-04 Thread Asmus Freytag

Philipe,

Text typeset in Fraktur contains more information than text typset in 
Antiqua. That means, there are some places where there are some (mild) 
ambiguities in representation in the Antiqua version. Not enough to 
bother a human reader who can use deep context to read the text 
correctly, but enough so that a mere typesetting system cannot correctly 
render such a text in Fraktur.


I'm not currently aware of anything that would prevent an automated 
system from converting a text encoded for Fraktur to one encoded for 
Antiqua, because you are merely throwing away information.


So far we agree.

The question is whether it would be possible to make this process work 
by default in common, unmodified rendering engines, and whether that 
is desirable. (I don't treat either of these question as settled one way 
or the other - so please don't attribute a position to me on that subject).


What I do know is that there are historic documents using Antiqua fonts 
that do use the long s. Therefore, in principle, you don't necessarily 
want to create fonts that map long to round s. And, as an author, you 
can't rely on such a font being present on the reader's end - it might 
equally likely be one that does implement the long s.


So, whatever automatic rendering of Fraktur-ready text with non-Fraktur 
general purpose fonts you have in mind, should not rely on this kind of 
non-standard glyph substitution. That would be a terrible hack, 
imperiling the ability of people to use the long s outside the context 
of the Fraktur tradition.


All I had argued for was that Karl should take out the consideration of 
rendering text encoded for Fraktur from his proposal and make it part of 
a separate document that addresses ALL issues of this type of rendering, 
making it a complete specification - that would be something that allows 
review on its own merits.


A./





Re: Plain text

2010-07-29 Thread Asmus Freytag

On 7/28/2010 9:32 PM, Doug Ewell wrote:

Murray Sargent murrays at exchange dot microsoft dot com wrote:

It's worth remembering that plain text is a format that was 
introduced due to the limitations of early computers. Books have 
always been rendered with at least some degree of rich text. And due 
to the complexity of Unicode, even Unicode plain text often needs to 
be rendered with more than one font.


I disagree with this assessment of plain text.  When you consider the 
basic equivalence of the same text written in longhand by different 
people, typed on a typewriter, finger-painted by a child, 
spray-painted through a stencil, etc., it's clear that the sameness 
is an attribute of the underlying plain text.  None of these examples 
has anything to do with computers, old or new.
That may be, but the way Unicode plain text is designed, is based on the 
concept of plain text in computers, and what that means was hashed out 
long before Unicode arrived on the scene. To a large measure, what 
Unicode did, was extend that concept to additional writing systems (and 
to historic or rarely used nooks and crannies of some of the existing 
writing systems).


In the process, your definition of plain text was pulled out, dusted 
off, and used as a philosophical underpinning of the enterprise - but 
the technologists in the effort did not first discard any notions of 
computer-based plain text before proceeding. In other words, claiming a 
clean break between the existing ASCII plain text and Unicode would be 
a falsification.


I do agree that rich text has existed for a long time, possibly as 
long as plain text (though I doubt that, when you consider really 
early writing technologies like palm leaves), but I don't think that 
refutes the independent existence of plain text.  And I don't think 
the need to use more than one font to render some Unicode text implies 
it isn't plain text.  I think that has more to do with aesthetics (a 
rich-text concept) and technical limits on font size.
No, it's not headings and the like. If you pull together a selection of 
ordinary books in the English language and remove rich text attributes, 
you will find a considerable fraction of the works will exhibit subtle 
changes in meaning - these works require italics to mark emphasis in 
places where the same sequence of words can be read in different ways.


Scholarly works require italics for citations - absent italics, some 
other method would need to be introduced to mark titles, without any 
designation, there can and will be ambiguities.


Hence, not all texts can be expressed as plain text.

If you take a German text, rendered (by a human typesetter) in Fraktur 
and rendered (by a later typesetter) in Antiqua, you will find that the 
second version has less information in it, when you encode both texts on 
a computer. And many texts that can be represented as plain text if they 
are to be rendered in Antiqua cannot be plain text if they are to be 
rendered according to the rules of typesetting a work in the Fraktur 
style - again, we are talking ordinary running text, no headings, 
bibliographies or anything.


The additional information is not of an aesthetic or stylistic nature, 
but tied to the meaning of certain words - that which Unicode calls 
semantic.
In other words, the text, as rendered in Antiqua, allows for potential 
ambiguities - not necessarily fatal ones, because context may easily 
resolve them, but they are there, nevertheless.


This is just one example how the concept of an abstract content of a 
piece of text is not nearly as clearcut as you might think.


On the contrary, the definition of Unicode plain text is straight 
forward: a sequence of Unicode characters without any style information.


A./



Re: High dot/dot above punctuation?

2010-07-28 Thread Asmus Freytag

On 7/28/2010 2:02 AM, Kent Karlsson wrote:


Den 2010-07-28 09.50, skrev Jukka K. Korpela jkorp...@cs.tut.fi:

  

André Szabolcs Szelp wrote:



Generally, for the decimal point . (U+002E FULLSTOP) and , (U+002C
COMMA) is used in the SI world. However, earlier conventions could use
different notation, such as the common British raised dot which
centers with the lining digits (i.e. that would be U+00B7 MIDDLE DOT).
  

The different dot-like characters are quite a mess, but the case of British
raised dot is simple: it is regarded as typographic variant of FULL STOP.

Ref.: http://unicode.org/uni2book/ch06.pdf (second page, paragraph with
run-in heading Typographic variation).



And the Nameslist says:
002EFULL STOP
= period, dot, decimal point
* may be rendered as a raised decimal point in old style numbers

However, I think that is a bad idea: firstly the digits here aren't
necessarily old style (indeed, André wrote lining, i.e. NOT
old style). And even if they are old style, it seems to me to be a
bad idea to make this a contextual rendering change for FULL STOP
(and it also says may not shall so there is no way of knowing
which rendering you should get even with old style digits).
Better stay with the MIDDLE DOT for the raised decimal dot.
  
The real problem I have with this annotation is that it recommends a 
practice that I strongly suspect has never been implemented in the 
entire 20 years since it's been on the books. (If anyone knows of an 
implementation that has contextual rendering of FULL STOP, I'd like to 
learn about it here.)


If a particular text uses both raised periods and raised decimal points, 
then I see use in being able to use 002E for this and make it change by 
using a font with a different glyph. But if it applies only to the 
decimal point, overloading 002E would require a degree of context 
analysis that I believe is unimplemented (see above). If my suspicion is 
true, then, at the minimum, the annotation should be reworded so that it 
doesn't seem to imply a practice that doesn't exist.

Further, I don't see any major problem with using U+02D9 DOT ABOVE
for high dot in this case.
  
Me neither - if it's positioned right, then it should be used. 
Duplicating dots by function is definitely a no-no. However, unfiying 
punctuation characters with definite differences in appearance only 
works well if these differences are systematically applied with a 
type-style (font) selection and then apply to the entire text in each 
font. Such as the use of a double oblique glyph for HYPHEN (and 
HYPHEN-MINUS) in Fraktur fonts.


A./




Re: High dot/dot above punctuation?

2010-07-28 Thread Asmus Freytag

On 7/28/2010 10:09 AM, Murray Sargent wrote:
Contextual rendering is getting to be more common thanks to adoption of OpenType features. For example, both MS Publisher 2010 and MS Word 2010 support various contextually dependent OpenType features at the user's discretion. The choice of glyph for U+002E could be chosen according to an OpenType style. 
  
I know that the technology exists that (in principle) can overcome an 
early limitation of 1:1 relation between characters and glyphs in a 
single font. I also know that this technology has been implemented for 
certain (but not all) types of mappings that are not 1:1.

It's worth remembering that plain text is a format that was introduced due to 
the limitations of early computers. Books have always been rendered with at 
least some degree of rich text. And due to the complexity of Unicode, even 
Unicode plain text often needs to be rendered with more than one font.
  
However, the question I raised here is whether such mechanisms have been 
implemented to date for FULL STOP. Which implementation makes the 
required context analysis to determine whether 002E is part of a number 
during layout? If it does make this determination, which OpenType 
feature does it invoke? Which font supports this particular OpenType 
feature?


A./






Re: Reasonable to propose stability policy on numeric type = decimal

2010-07-28 Thread Asmus Freytag

On 7/28/2010 10:13 PM, Martin J. Dürst wrote:
Sequences of numeric Kanji are also used in names and word-plays, and 
as sequences of individual small numbers.


But the same applies to our digits. A very simple example is to use 
them as a ruler in plain text:


 1 2 3 4 5 6 7
1234567890123456789012345678901234567890123456789012345678901234567890


Didn't see this before I sent mine. Martin says it better.

A./




Re: Why does EULER CONSTANT not have math property and PLANCK CONSTANT does?

2010-07-27 Thread Asmus Freytag

On 7/27/2010 3:02 PM, Kenneth Whistler wrote:

Karl Williamson asked:

  

Subject: Why does EULER CONSTANT not have math property and PLANCK CONSTANT 
does?



  

They are U+2107 and U+210E respectively.



Because U+210E PLANCK CONSTANT is, to quote the standard,
simply a mathematical italic h. It serves as the filler for
the gap in the run of mathematical italic letters at U+1D455.
  

Correct - they form a set and need to be treated consistently.


Other letterlike symbols in that block are not given the
Other_Math property, even if they may be used in mathematical
expressions. (Note that regular Greek letters are also not
given the Other_Math property, even though they obviously also
occur in mathematical expressions.)
  
For Euler Constant and Weierstrass elliptic function, this doesn't make 
a lot of sense, as these are explicitly mathematical characters, not 
characters that are also used in mathematical expressions.


I have put in a formal proposal to add these two (2107 and 2118) to the 
list of characters with the math property.

The Math property can be thought of as a hint that a particular
symbol is specialized for mathematical usage; it isn't a
property that any character that ever occurs in a mathematical
expression needs to have. Nor is every character with
the Math property only used in mathematical contexts.
  
One way to look at this property is as a way to help detection of 
mathematical expressions in running text. Characters that are primarily 
used for mathematical purposes, or prominently used there, should be 
included. Characters that are heavily used in ordinary text, with 
non-mathematical uses should be excluded.


A./
  
  





Re: ? Reasonable to propose stability policy on numeric type = decimal

2010-07-26 Thread Asmus Freytag

On 7/26/2010 12:13 PM, Mark Davis ☕ wrote:
I agree that having it stated at point of use is useful - and we do 
that in other cases covered by stability clauses; but we can only 
state it IF we have the corresponding stability policy.

Mark,

The statement in your but clause really isn't correct.

Writing

A character is given/is assigned the X property if

is a type of statement that is made everywhere in the definitions of 
properties. For an example look no further than chapter 4 (Pairs of 
opening and closing punctuation are given their General_Category 
values...).


Therefore, the principal difference between my proposed formulation to 
the current text (other than details of phrasing) is the only if part. 
The only if refers to the fact that Decimal_digit is currently not 
assigned for characters used as decimal digits that are out of order.


Therefore, there's nothing in the proposed language that couldn't be 
stated right now for 6.0.


If you want a stability guarantee on top of that, it's really easy to 
state *after* you've clarified the definition of decimal_digit.


The definition of Decimal_Digit will not change.

*That* would be a proper stability guarantee.

A./

PS: I'm, like John, rather skeptical about adding a formal item to the 
stability policies, but if a majority feels otherwise, I would strongly 
recommend to first make a tight definition, and second, freezing that 
definition, rather than repeating the definition in the stability 
policies where it's hard to follow and out of context.



Proposed text:

//


  A character is given the decimal digit property, if and only
if, it is
   used in a decimal place-value notation and all 10 digits are
encoded
   in a single unbroken run starting with the digit of value 0, in
ascending
  order of magnitude.






Re: Reasonable to propose stability policy on numeric type = decimal

2010-07-25 Thread Asmus Freytag
The short answer to Karl's question is that there will not be an 
absolute guarantee.


The long answer is that, partly for the reasons he's mentioned, this 
won't be a practical problem.


A. Most of the living scripts that are in wide use have been encoded, 
including whatever digits are in use.
B. People reviewing encoding proposals include programmers who would 
object to scattering digits


Thus, the only time this would be an issue is if there were some 
exceptional circumstances. And, as the name says, those circumstances 
could force an exception. If that happens there are two possible 
consequences:


1. The script in question is important enough that everybody will build 
in exceptions into their conversion algorithms
2. The script is so unimportant, that its number system won't be 
supported (i.e. it's treated just like other text).


So, for extending your computer language, there's no reason to hold up 
support for many important scripts, just because of a hypothetical 
future exception.


A./

PS: just because I suspect more than one existing implementation to be 
offset-based, there would be tremendous pressure to prevent exceptions 
already :)


PPS: a very hypothetical tough case would be a script where letters 
serve both as letters and as decimal place-value digits, and with modern 
living practice.  Having a policy like you suggest would officially make 
that unsupportable, but there are other cases, like the language that 
wanted to used @ sign as a letter, that are de-facto unsupportable with 
the modern infrastructure. My suspicion is that users of such a script 
would realize that their method is de-facto unsupported/able and find 
some way to change their ways. Changing practices in the face of 
changing technology is something that happens all the time, not only to 
small communities - but that's an entirely new subject :)






Re: Reasonable to propose stability policy on numeric type = decimal

2010-07-25 Thread Asmus Freytag

On 7/25/2010 6:05 PM, Martin J. Dürst wrote:



On 2010/07/26 4:37, Asmus Freytag wrote:


PPS: a very hypothetical tough case would be a script where letters
serve both as letters and as decimal place-value digits, and with modern
living practice.


Well, there actually is such a script, namely Han. The digits (一、 
二、三、四、五、六、七、八、九、〇) are used both as letters and as 
decimal place-value digits, and they are scattered widely, and of 
course there are is a lot of modern living practice.

Martin,

you found the hidden clue and solved it, first prize :)

They do not show up as gc=Nd, nor as numeric types Digit or Decimal.

The situation is worse than you indicate, because the same characters 
are also used as elements in a system that doesn't use place-value, but 
uses special characters to show powers of 10.


However, as I indicated in my original post, in situations like that, 
there are usually some changes in practice that took place. Much of the 
living modern practice in these countries involves ASCII digits. While 
the ideographic numbers are definitely still used in certain contexts, 
I've not seen them in input fields and would frankly doubt that they 
exist there. I would fully expect that they are supported as number 
format for output, at least in some implementations, and, of course, 
that input methods convert ASCII digits into them. In other words, I 
wonder whether automatic conversion goes only one-way for these numbers. 
I would suspect it, for the general case, but I don't actually know for 
sure.


For someone in Karl's situation, it would be interesting to learn 
whether and to what extent he should bother supporting these numbers in 
his language extension.


A./



Re: ? Reasonable to propose stability policy on numeric type = decimal

2010-07-24 Thread Asmus Freytag

On 7/24/2010 3:00 PM, Bill Poser wrote:

On Sat, Jul 24, 2010 at 1:00 PM, Michael Everson ever...@evertype.com wrote:

  

Digits can be scattered randomly about the code space and it wouldn't make any 
difference.



Having written a library for performing conversions between Unicode
strings and numbers, I disagree. While it is not all that hard to deal
with the case in which the characters for the digits are scattered
about the code space, if they occupy a contiguous set of code points
in order of their value as they do, e.g., in ASCII, it simplifies both
the conversion itself and such tasks as identifying the numeral system
of a numeric string and checking the validity of a string as a number
in a particular numeral system.

It may well be that adopting such a policy is not realistic, but there
would be advantages to it if were.
  
Bill, 


Michael is no programmer, hence he doesn't have first hand understanding why 
programmers distiguish between character set mapping (normally requiring 
look-up tables) and digit conversion (normally done by offset calculations).

That said, there are enough programmers on the committees so that scattered 
encoding of digits, while not prevented, is at least not the method of choice.

The problem with making this a policy is that some scripts may not have a 
decimal place-value type number system (or such use is not documented) at the 
time of their encoding. That means, a digit zero may not be known or documented.

However, a prudent encoding policy would be to leave a gap in that case, 
because there have been scripts for which use of a decimal place-value system 
was later discovered.

A./





Re: charset parameter in Google Groups

2010-07-07 Thread Asmus Freytag

Andreas,

I think we all realize your frustration with well-meaning software.
Because tags can be wrong for no fault of the human originating the 
document,
I fully understand that Google might want to attempt to improve the user 
experience in such situations.


The problem is that doing so should not come at the expense of authors 
who correctly tag their documents and whose servers preserve their tags 
and don't mess with them. That your message was broken exposed a bug in 
Google's implementation. And that was acknowledged as well.


I have not seen any design details of the algorithm that Google uses 
(when correctly implemented) so I can't comment on whether it strikes 
the correct balance between honoring tags in the normal case, where 
they should be presumed to be correctly applied, vs. detecting the case 
of clearly erroneous tags and doing something about them so users aren't 
stuck when documents are mis-tagged.


However, in principle, I support the development of tools that can 
handle imperfect input - after all, you as a human reader also don't 
reject language that isn't perfectly spelled or that violates some of 
the grammatical rules.


There's a benefit to these kinds of tools, but, as you keep reminding 
us, there's a cost (which needs to be minimized). This cost is similar 
to that of a spell-checker rejecting a correctly spelled word. Still we 
are better off with them than without them.


For that reason, I think you will find few takers for your somewhat 
absolutist position, whereas you would get more sympathy if you were 
simply reminding everyone of the dangers of poorly implemented solutions 
that can break correctly tagged data.


A./



Re: charset parameter in Google Groups (was Re: Indian Rupee Sign to be chosen today)

2010-06-28 Thread Asmus Freytag

On 6/28/2010 11:38 AM, Mark Davis ☕ wrote:



The problem with slavishly following the charset parameter is that it 
is often incorrect. However, the charset parameter is a signal into 
the character detection module, so the charset is correctly supplied 
from the message then the results of the detection will be weighted 
that direction.


The weighting factor / mechanism may be something that you might look at 
for possible improvement.


Doug raised an interesting argument, i.e. that some values of a charset 
parameter might have a higher probability of being correct than other 
values.


If something is tagged Latin-1 or Windows-1252, the chances are that 
this is merely an unexamined default setting. Most of the other 8859 
values should be much less likely to be such blind defaults.


I wonder whether the probability of successful charset assignment 
increases if you were to give these more specific charset values a 
higher weight.


When I played with simple recognition algorithms about 15 years ago, I 
found that some simple methods for crude language detection gave 
signatures that would allow charset detection. Even though these methods 
weren't sophisticated enough to resolve actual languages (esp. among 
closely related languages) they were good enough to narrow things down 
to the point, where one could pick or confirm charsets.


For example, significant stretches of German can be written without 
diacritics, and can fool charset detection unless it picks up on the 
statistic patterns for German. With that in hand, the first non-ASCII 
character encountered is then likely to nail the charset. Or, absent 
such character, the statistics can be used to confirm that an existing 
charset assignment is plausible. (8859-15 having been deliberately 
designed to be undetectable is the exception, unless there's a Euro 
sign in the scanned part of the document...)


A./



Re: Latin Script

2010-06-28 Thread Asmus Freytag

I'd like to second Mark.

There is a lot of information in the Standard, including the UAXs, and 
the Unicode Character Database that would help answer your questions.


The volunteers associated with the Unicode effort have worked hard 
putting all that information together - so use it, instead of taking up 
their time in repeating it all in personal answers to you.


A./

On 6/28/2010 9:37 PM, Mark Davis ☕ wrote:
See the following for the (/many/) differences between characters with 
the Latin script, and those with LATIN in their names.


http://unicode.org/cldr/utility/unicodeset.jsp?a=\p{script:latin}b=\p{name:/LATIN/} 
http://unicode.org/cldr/utility/unicodeset.jsp?a=%5Cp%7Bscript:latin%7Db=%5Cp%7Bname:/LATIN/%7D


I'd suggest taking a more focused approach to learning about the 
standard, rather than trying relatively scattershot questions to this 
list. You might read through at least the first 3 chapters of the 
Unicode Standard, plus the Scripts UAX. These are all online for free 
at unicode.org http://unicode.org.


Mark

— Il meglio è l’inimico del bene —


On Mon, Jun 28, 2010 at 20:55, Tulasi tulas...@gmail.com 
mailto:tulas...@gmail.com wrote:


Looks like Unicode did not create any name for any Latin letter/symbol
with LATIN in its name :-')

Am I correct?

Is there a mailing list for ISO/IEC ?

 I don't think it's necessary to post these glyphs to the public
list.

Better to do like Edward Cherlin, i.e., type the symbol after the
name.

e.g., LATIN SMALL LETTER PHI (ɸ)

That way an illiterate like me can quickly see the letter/symbol along
with its name, without additional research.

 The merger between Unicode and ISO 10646 caused a few character
names in
 Unicode to be changed to match the 10646 names.

My I know these letters/symbols with names please?

Tulasi
PS: Thanks Doug, especially for posting the links


From: Doug Ewell d...@ewellic.org mailto:d...@ewellic.org
Date: Sun, 27 Jun 2010 16:09:41 -0600
Subject: Re: Latin Script
To: Unicode Mailing List unicode@unicode.org
mailto:unicode@unicode.org
Cc: Tulasi tulas...@gmail.com mailto:tulas...@gmail.com

Tulasi tulasird at gmail dot com wrote:

 U+00AA FEMININE ORDINAL INDICATOR (which does not contain
LATIN) is
 considered part of the Latin script, while U+271D LATIN CROSS
(which
 does) is considered common to all scripts.

 Can you post both symbols please, thanks?

I can point you to http://www.unicode.org/charts/PDF/U0080.pdf , which
includes a glyph for U+00AA, and
http://www.unicode.org/charts/PDF/U2700.pdf , which includes a
glyph for
U+271D.  I don't think it's necessary to post these glyphs to the
public
list.

 Trying to know who among ISO and Unicode first created the
names' list
 for Latin-script is not an indication of obsession :-')

 So among Unicode and ISO/IEC, who first created ISO/IEC 8859-1 
 ISO/IEC 8859-2 letters/symbols names with each name with LATIN
in it?

Most of the characters in the various parts of ISO 8859 were
originally
standardized before Unicode or ISO 10646, so the names were probably
either created by the ISO/IEC subcommittees responsible for those
parts,
or found in earlier standards and adopted as-is.

The merger between Unicode and ISO 10646 caused a few character
names in
Unicode to be changed to match the 10646 names.

--
Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org
RFC 5645, 4645, UTN #14 | ietf-languages @ is dot gd slash 2kf0s ­








Re: Generic Base Letter

2010-06-27 Thread Asmus Freytag
The one argument that I find convincing is that too many implementations 
seem set to disallow generic combination, relying instead on fixed 
tables of known/permissible combinations.


In that situation, a formally adopted character with the clearly stated 
semantic of is expected to actually render with ANY combining mark from 
ANY script would have an advantage. List-based implementations would 
then know that this character is expected to be added to the rendering 
tables for all marks of all scripts.


Until and unless that is done, it couldn't be used successfully in those 
environments, but if the proposers could get buy-in from a critical mass 
of vendors of such implementations, this problem could be overcome.


Without such a buy-in, by the way, I would be extremely wary of such a 
proposal, because the table-based nature of these implementations would 
prohibit the use of this new character in the intended way.


A./



Re: Indian Rupee Sign to be chosen today

2010-06-26 Thread Asmus Freytag

On 6/26/2010 5:41 PM, Doug Ewell wrote:



Regarding the inability to distinguish 8859-15 heuristically from 
8859-1, I understand the problem when there are no tags or other 
hints, or for cases like Windows-1252 text declared to be 8859-1, but 
it seems unlikely to me that there is much text encoded in 8859-1 (or 
Windows-1252) that is tagged as 8859-15.  I would think in a case like 
that, it might make sense to trust the tag.  I suspect the problem of 
unreliable declarations is greater for most other tuples of 
(declared-encoding, actual-encoding).

Doug,

this is an interesting concept, i.e. that the reliability of the tag 
being correct might well depend on the value of the tag. I wonder 
whether that type of probability is being  considered at all when making 
the decision to trust auto-recognition over tag value.


A./




Re: Latin Script

2010-06-17 Thread Asmus Freytag

On 6/17/2010 7:24 PM, Tulasi wrote:
What is equivalent ISO/IEC 

ISO/IEC what?

There are hundreds of ISO/IEC standards, of which dozens are character 
encoding standards.

for U+0278 LATIN SMALL LETTER PHI (ɸ)?
Or do Unicode  ISO/IEC use different number  name for same letter/symbol?
  


ISO/IEC 10646 uses the same number and name as Unicode for this.

A./
  





Re: Writing a proposal for an unusual script: SignWriting

2010-06-14 Thread Asmus Freytag

On 6/14/2010 1:18 PM, Mark E. Shoulson wrote:

On 06/14/2010 02:15 PM, Asmus Freytag wrote:

On 6/14/2010 9:21 AM, Stephen Slevinski wrote:


Plain text SignWriting should be able to write actual sign language,
such as hello world.

You could equally well insist that it should be possible to express the
opening bar of twinkle, twinkle little star in plain text, or write
the square root of the inverse of a plus b in plain text.

In both cases, you would be disappointed and find that a markup language
is required, such as MathML, although specifically for math, it is
possible to device an extremely light weight markup language that comes
close to plain text.


It is all too tempting and too easy for discussions of Why X Should 
be Encoded in Unicode to devolve into Why X is So Incredibly 
Useful.  In this case, I don't think that's the point.  

Correct, we were not discussing that question.
Unlike some other proposals, I think it is clear (to me, anyway) that 
SignWriting has a fairly solid user-base and also an important use 
(transcribing signed languages, which don't really have too many other 
ways of being transcribed. Things like HamNoSys are also not encoded 
yet).
Mark (Davis) raised the good point that this needs to be substantiated - 
for now, for the purposes of this discussion, I taken the above as a given.
Here, the question is more a matter of given that SignWriting is 
nifty, does it qualify as plain text?  


That is the central question.

Or even Does the way SignWriting does its thing map well to the way 
Unicode does things?  


I tried to explain that these are nearly equivalent. A practical 
definition of plain text could be, text encoded as a stream of Unicode 
characters, with no other information. However, there are other 
definitions of plain text based on the ideal concept of the thing, and 
the two don't overlap 100%. Both are useful.


If it does not (and cannot be made to do so), then no matter how 
useful SignWriting is, it may simply not be encodable.  It's not 
because it doesn't deserve to be, and yes, that would really be a 
bummer because it would relegate signed languages to second-class, but 
Unicode has its limitations, and SignWriting may well be beyond its 
capabilities.


That's where my insistent questions about a layered system come in. One 
where the elements (symbols) are encoded in Unicode, but where some or 
all the details of their relation is encoded in a higher level protocol.


I suspect that the XML attempts that exist do not implement a correct 
layering, that is, they probably encode the identity of the symbols not 
as character codes but as named entities. That would explain why Steve 
said same data, only more complex.


(That said, I find myself thinking that it *should* be possible to 
align Unicode and SignWriting.  But I recognize that it might not be.)
As long as the position of the proponents is that all fine details of 
formatting and layout must be carried in the character encoding level, 
I'm not hopeful.




Not all streams of concrete small integers are ispso facto plain text,
even though you can map these integers to the private use space.


I guess you would need to establish a distinct and independent meaning 
for each code-point, which would have to be something more specific 
than ...and then you give the x-coordinate.
Generic placement operators I could possibly fathom, since they serve to 
linearize the text - an analogy would be the Ideographic Description 
Symbols that allow description of a two dimensional layout. But the IDS 
stop short of trying to express the subtle modifications that arise out 
of the context and placement of the elements in the final ideograph. For 
that you have to turn to another source, in this case a font.



For the future, I am considering a browser plugin that will detect and
render SignWriting character data. A regular expression could scrape
the appropriate PUA characters. Another regular expression could
validate that the characters represent valid structures. Then the
SignWriting display could be built using individual symbols, completed
signs, or entire columns.


In other words, a layout engine.


Is there such a thing as SignWriting without a layout engine?  I guess 
the same question could be asked about Musical notation (though I 
think it probably could have been coded as plain text.  See also 
http://abcnotation.com/ for a very powerful musical notation using 
only ascii, but decidedly *not* plain-text in nature).
The point is, because one already requires a layout engine (or browser 
plug-in) one might as well use something like MathML in conjuction with 
standard character codes for the basic symbols.



If SignWriting cannot be successfully used except with 2 fonts, then I
see little need for standardizing the code. What you describe is a
private use scheme, even though the private group may have many members.


I'm not sure I agree with this.  Just because only two fonts are out

Re: Tamil u,uu matra consonants - Orthographic variation

2010-06-09 Thread Asmus Freytag

Can we stop double posting on Unicode and Unicore list?

People on the unicode list cannot reply to people on the other list,
and vice versa (unless they happen to be mermbers of both lists).

Thanks.

A./



Re: Questionable lines on LineBreakTest.txt

2010-06-07 Thread Asmus Freytag

On 6/7/2010 4:26 PM, Masaaki Shibata wrote:

I'm studying the UAX #14 (5.2.0) and testing my code against
LineBreakTest.txt. And I found some test cases on this text file seem
to be contradictory to the rules on the document.

For example, LB25 explicitly prohibits breaking between CP and PO,
while LineBreakTest.txt says ÷ [0.2] RIGHT PARENTHESIS (CP) ÷ [999.0]
PERCENT SIGN (PO) ÷ [0.3] (l. 1137).

I'm not a Unicode expert; which rules lead to the result like this?
Did I miss any important descriptions on the document?
  

Probably not. The test file has been known to be wrong before.

The spec clearly states that breaks are only allowed if there are spaces,
as in:

CP SP+ ÷ OP

So this line in the test file appears incorrect.

A./



  





Re: Least used parts of BMP.

2010-06-04 Thread Asmus Freytag

On 6/4/2010 8:34 AM, Mark Davis ☕ wrote:
In a compression format, that doesn't matter; you can't expect random 
access, nor many of the other features of UTF-8.


The minimal expectation for these kinds of simple compression is that 
when you write a string with a particular /write/ method, and then 
read it back with the corresponding /read/ method, you get exactly the 
original string contents back, and you consume exactly as many bytes 
as you had written. There are really no other guarantees.
Actually, SCSU makes an additional guarantee, which is that you can edit 
the compressed string. In other words, you can insert a substring such 
that the new string remains a valid compressed string and the parts 
preceding and following the insertion, when read, match the 
corresponding portion of the original after decoding. I remember that 
this was an important design criterion for the precursor RCSU.  Their 
implementation required the ability to deliver a patch to a compressed 
string, something that isn't possible with many other compression formats.


So there is a sliding scale in features, each compression method being 
designed to address the specific requirements of given application.


A./


Mark

— Il meglio è l’inimico del bene —


On Fri, Jun 4, 2010 at 06:35, Otto Stolz otto.st...@uni-konstanz.de 
mailto:otto.st...@uni-konstanz.de wrote:


Hello,

Am 2010-06-03 07:07, schrieb Kannan Goundan:

This is currently what I do (I was referring to this as the
compact
UTF-8-like encoding).  The one difference is that I put all the
marker bits in the first byte (instead of in the high bit of every
byte):
  0xxx
  10xx xyyy
  110x xxyy yzzz


The problem with this encoding is that the trailing bytes
are not clearly marked: they may start with any of
'0', '10', or '110'; only '111' would mark a byte
unambiguously as a trailing one.

In contrast, in UTF-8 every single byte carries a marker
that unambiguously marks it as either a single ASCII byte,
a starting, or a continuation byte; hence you have not to
go back to the beginning of the whole data stream to recognize,
and decode, a group of bytes.

Best wishes,
 Otto Stolz









Re: Greek letter LAMDA?

2010-06-02 Thread Asmus Freytag

On 6/1/2010 6:04 PM, Mark Crispin wrote:

I don't think that the unicode list should be used for the type of
questions that have polluted it recently.

That list unicode@unicode.org is open for general questions.
It has no formal standing as far as the business of the Consortium
is concerned, and many core UTC members are NOT on this
list, because it attracts general questions etc.

A./

PS: and if you've forgotten, once does need to subscribe to the list
in order to post, so it already fits your definition of members-only.




Re: Least used parts of BMP.

2010-06-02 Thread Asmus Freytag

On 6/1/2010 8:04 PM, Kannan Goundan wrote:

I'm trying to come up with a compact encoding for Unicode strings for
data serialization purposes.  The goals are fast read/write and small
size.
  

Why not use SCSU?

You get the small size and the encoder/decoder aren't that complicated.

You get the additional advantage that some many years in the future, if 
data that are serialized to your scheme are found on an old hard-disk, 
someone has a chance to read them, because SCSU is well-documented (see 
UTS#6).


A./





Re: Greek letter LAMDA?

2010-06-02 Thread Asmus Freytag




On 6/2/2010 11:46 AM, Jonathan Rosenne wrote:

  Although this mail was not addressed to me, I did read it. Sue me.
  

The terms of use for the Unicode mail list essentially state that these
types of boilerplate are null and void as far as Unicode is concerned.
You will find the following in
http://www.unicode.org/policies/mail_policy.html


  Disclaimer
E-mail submitted to any of our e-mail lists which contains disclaimers
of confidentiality or reservation of copyright, or similar, will be
treated as if these disclaimers were not present, and neither the
Consortium nor the users of our e-mail lists shall be liable for their
use of the information in the e-mail under this policy. It is up to the
submitter to ensure that no confidential or otherwise restricted
information is sent to any e-mail list.


As you can see, they have no grounds to sue you. :)

A./

  
Jony

  
  
-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On
Behalf Of John Dlugosz
Sent: Wednesday, June 02, 2010 5:03 PM
Cc: unicode@unicode.org
Subject: RE: Greek letter "LAMDA"?



  Robert Abel noted:

Note that as of 1993, the only "LAMDA" or "LAMBDA" characters
in the standard were:

039B;GREEK CAPITAL LETTER LAMDA;Lu;0;L;N;GREEK CAPITAL LETTER
LAMBDA;;;03BB;
03BB;GREEK SMALL LETTER LAMDA;Ll;0;L;N;GREEK SMALL LETTER
LAMBDA;;039B;;039B
019B;LATIN SMALL LETTER LAMBDA WITH STROKE;Ll;0;L;N;LATIN SMALL
LETTER BARRED
LAMBDA

  

So why was 019B spelled differently than the other two, originally?


TradeStation Group, Inc. is a publicly-traded holding company (NASDAQ
GS: TRAD) of three operating subsidiaries, TradeStation Securities,
Inc. (Member NYSE, FINRA, SIPC and NFA), TradeStation Technologies,
Inc., a trading software and subscription company, and TradeStation
Europe Limited, a United Kingdom, FSA-authorized introducing brokerage
firm. None of these companies provides trading or investment advice,
recommendations or endorsements of any kind. The information
transmitted is intended only for the person or entity to which it is
addressed and may contain confidential and/or privileged material. Any
review, retransmission, dissemination or other use of, or taking of any
action in reliance upon, this information by persons or entities other
than the intended recipient is prohibited. If you received this in
error, please contact the sender and delete the material from any
computer.


  
  



  







Re: Greek letter LAMDA?

2010-06-02 Thread Asmus Freytag

On 6/2/2010 3:28 PM, John Dlugosz wrote:


If anyone can “null and void” it, I wonder why companies bother to put 
such things in people’s outgoing mail. I would have thought they could 
come up with a proper net-etiquite version, but they just don’t care.


These things are bogus, because they get appended automatically to all 
messages leaving certain mailers, independent of the nature of the 
message. I wouldn't be surprised if they are hard to enforce, but I'm 
not a lawyer.


The Unicode list can certainly set its own conditions for participation, 
and because you have to sign up, I'd rate the chance that Unicode can 
enforce its rules on participants rather high.


Therefore, anyone sending messages with funny legal mumbo-jumbo is put 
on notice beforehand that it will not be respected. If they go ahead and 
send it anyway, that's their choice, but they'd have a tough time 
arguing that they could have a reasonable expectation that it would be 
honored. So, I think, in the case of a mail list like this, you can 
actually get away with declaring these things null  void.


Cheers,

A./

PS: we should stop with this topic, because that's not what this list is 
for.







Re: Least used parts of BMP.

2010-06-02 Thread Asmus Freytag
SCSU is a pass-through for ASCII, plus it handles the common mix of 
ASCII plus 96 local characters (Latin-1, Greek, Cyrillic, Thai, etc) 
really fast. Go look at the sample code. If you take that as starting 
point for optimization, I think you'll be fine.






Re: Greek letter LAMDA?

2010-06-01 Thread Asmus Freytag

On 6/1/2010 1:37 PM, John Dlugosz wrote:


Why does the code chart call the plain Greek letter (upper and lower 
case) “LAMDA” rather than “LAMBDA”? The latter is used in other places 
where a glyph is based on the lambda, e.g. “U+019B LATIN SMALL LETTER 
LAMBDA WITH STROKE”


Names sometimes don't use the best spellings, but because they can't be 
changed, any spelling issues discovered after the first encoding can't 
be fixed. Make sure you don't use the Standard as a spelling reference.


A./



Re: Greek letter LAMDA?

2010-06-01 Thread Asmus Freytag

On 6/1/2010 4:14 PM, Mark Crispin wrote:
Is it really necessary to have this sort of pedagogical discussions on 
the

Unicode list?
Is this character name misspelled?
Is Unicode a for-profit company?
Who owns the Unicode font?

etc. etc.

Perhaps we need to have a unicode-qu...@unicode.org for novices to ask
questions, and make this list be member-only?

There is a member only list for in-depth technical discussions.
Are you a member?

A./




Re: Unicode Inc

2010-05-31 Thread Asmus Freytag

On 5/31/2010 12:33 PM, Tulasi wrote:

Thanks Mark for posting the links!
My posting was based on
http://www.unicode.org/consortium/directors.html
where in the bottom it said Unicode Inc.

Looks like the elected members from consortium
http://www.unicode.org/consortium/consort.html
forms Unicode Inc.

Am I correct?
  

Not really.

The members of the consortium are other organizations, usually
corporations. Each organization is represented by people (delegates).
The delegates are not members of the consortium, but merely people
who represent each member organization. Each organization normally
gets one vote, even though most send two delegates.

In this link http://www.unicode.org/consortium/consort.html
it looks like less than 100 members in the consortium.
How many members currently do you have who can vote to elect?

Link http://www.unicode.org/consortium/consort.html
says Unicode consortium is non-profit organization.

Recently I have purchased Windows 7 from Microsoft Corporation
This product has Unicode fonts for number of language.

But Microsoft Corporation is for profit.
So it looks like Unicode Inc is for profit through its elected officials,
but Unicode consortium is non-profit.

Am I correct?
  

No.

Unicode, Inc (The Unicode Consortium) is a non-profit organization.
That means, it must meet certain legal requirements and restrictions
in how it is funded and operated. The same requirements do not apply
to its membership. Specifically, both for-profit and not-for profit
organizations may be members of the Consortium. There is nothing
unusual about the fact that the for-profit status of the members is 
unrelated

to the non-profit nature of the Consortium. That's essentially the case for
all non-profit organizations.


I still do not understand:
What is the role of this Director exactly?
  

I think you are asking very basic questions about how a non-profit
corporation is organized.

Rather than continuing this discussion at great detail here on a list
that is intended for character encoding questions, you might start
by reading up on basic background, for example in the Wikipedia:

http://en.wikipedia.org/wiki/Non-profit_organization

and

http://en.wikipedia.org/wiki/Board_of_Directors

A./

Respectfully,
Tulasi


From: Mark Davis m...@macchiato.com
Date: Fri, 28 May 2010 09:14:00 -0700
Subject: Re: Unicode Inc
To: Tulasi tulas...@gmail.com
Cc: Unicode Discussion unicode@unicode.org

See http://www.unicode.org/consortium/consort.html. The consortium is
constituted according to its bylaws:
http://unicode.org/consortium/unicode-bylaws.html

Roughly, it is constituted by its membership:
http://www.unicode.org/consortium/memblogo.html, which elects the directors
yearly. The officers report to the directors, and are responsible for the
running of the consortium. The technical work is delegated to the technical
committees, which operate according to their procedures.

The background of the officers and directors can be found on
http://www.unicode.org/consortium/directors.html. For a historical view,
see http://www.unicode.org/history/boardmembers.html.

Mark

On Thu, May 27, 2010 at 17:32, Tulasi tulas...@gmail.com wrote:

I am new to this group.
I am browsing
http://www.unicode.org/consortium/directors.html
It looks like Unicode Inc is formed by
Google, Inc.
Microsoft Corporation
IBM Corporation
Apple

Have I understood correct?

Also it looks like Unicode Incorporate has one director to represent
whole Asia, while Asia has more languages than any continent.

What is the role of this Director exactly?

Respectfully,
Tulasi


  




Re: IS UNICODE a STANDRAD ?

2010-05-31 Thread Asmus Freytag

On 5/31/2010 2:12 PM, V. M. Kumaraswamy wrote:

Hello all,
 
Just a clarification an UNICODE.
 
Is UNICODE a STANDRAD

Yes, Unicode (The Unicode Standard), is indeed a standard.

And no, the use of ALL CAPS is discouraged. The
proper spelling is Unicode.

that needs to be followed by all COUNTRIES ?

there's no requirement for anyone to be conformant
to the Unicode Standard. However, if you decide to
claim conformance, there are specific requirements that
you must meet, and they are defined in the Standard.
 
Is UNICODE a CONSORTIUM  to make certain

guidlines that needs to be followed for CERTAIN CHARCTERISTICS  ?

Yes, Unicode (The Unicode Consortium) is indeed a consortium.
If you just use Unicode as a shorthand, you need to rely on the
context of your communication to allow readers to understand
whether you mean the Standard or the Consortium.

The Unicode Consortium is the publisher of The Unicode
Standard as well as several other technical standards.

As with the Unicode Standard, there is no requirement
that you support these standards. But if you decide to
claim conformance to any of them, there are specific
requirements that you must meet.

Hope this makes the situation more clear.
A./
 
This si just to some input from all of you.
 
Thanks

Sincerely
 
V. M. Kumaraswamy





<    4   5   6   7   8   9   10   11   12   13   >