Re: UTF-8 ill-formed question

2012-12-11 Thread vanisaac
From: James Lin 
> Hi
> Does anyone know why ill-form occurred on the UTF-8? besides it doesn't 
> follow > the pattern of UTF-8 byte-sequences, i just wondering how or why?
> If i have a code point: U+4E8C or "二"
> In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C". Where is this "BA" 
> comes from?
> 
> thanks
> -James 

Each of the UTF encodings represents the binary data in different ways. So we 
need to break the scalar value, U+4E8C, into its binary representation before 
we proceed.

4E8C -> 0100 1110 1000 1100

Then, we need to look up the rules for UTF-8. It states that code points 
between U+800 and U+ are encoded with three bytes, in the form 1110 
10xx 10xx. So plugging in our data, we get

4  E8 C
  0100   1110 10-00 1100
     //   \
+ 1110 10xx 10xx

= 11100100 10111010 10001100
or  E  4 B  A 8  C

-Van Anderson




Re: The rules of encoding (from Re: Missing geometric shapes)

2012-11-12 Thread vanisaac
William, I think you have a unreasonable idea of what a standard actually is. 
You have already made a standard and published it - I've seen all the posts at 
the FCP forum. All you have to do is let people use it. If a user community is 
going to exchange data, they will do so, and it just plain doesn't matter if 
some other user community were to exchange completely different data 
coincidentally using the same sequence of bytes.

The problem is that you don't want a /standard/ - you already have one. You 
want a legitimacy for your ideas that they haven't earned, and you are trying 
to borrow that legitimacy from Unicode and ISO. What you don't understand is 
that the legitimacy you want to borrow is intimately tied in with the fact that 
Unicode has policies and procedures that they follow, one of which is they do 
not recognize scripts that haven't met the criteria for inclusion.

From: William_J_G Overington 

> A feature of using the Private Use Area is that code point allocations are 
> made by a person or entity that is not a standards organization. Also, 
> Private Use Area code point assignments are not unique.

Which has not kept other PUA standards like MUFI and CSUR from successfully 
exchanging data. In fact, they both have successfully demonstrated usage to the 
point that scripts have then been allocated for public use.

> In many cases, neither of those features presents a problem for successful 
> use of a Private Use Area encoding.

> However, although one can often not be concerned with the fact that the code 
> point assignment is not unique, the fact that it is not made by a standards 
> organization is a big problem if one is seeking to have a system that one has 
> invented taken up by people and companies generally.

In other words, you want legitimacy that the idea has not earned.

> For one of my present uses of the Private Use Area I am seeking to have a 
> system that I have invented taken up by people and companies generally.

Then publish the standard and let them do it. If the idea is useful, then 
others will adopt it; if not, they won't.

> However, I feel that there is no chance of a system that I have invented 
> being taken up by people and companies generally using a Private Use Area 
> encoding. Thus, I feel that I will not be able to present an encoding 
> proposal document showing existing widespread usa.

This /feeling/ is specifically contradicted by the evidence of language 
communities adopting the MUFI and CSUR standards.

> However, if the Unicode Technical Committee and the ISO Committee were to 
> agree to the principle of encoding my inventions in plane 13, not necessarily 
> using the particular items or symbols that I am at present using in my 
> research, yet the committees working out how to form a committee or 
> subcommittee to work out what to encode, then I feel that a group project 
> with lots of people contributing ideas could produce a wonderful system 
> encoded into plane 13 that could be of great usefulness to many people.

If it is so wonderful and useful, there is no reason why you wouldn't be able 
to bring together a group of people to develop the standard in Plane 15 just as 
easily. If you can't do that, it's a pretty good indication that it's not as 
useful as you think it is.

> My present goal is to have the opportunity to write a document requesting 
> that agreement in principle and for the document to be considered and 
> discussed by the committees and a formal decision made.

The formal decision will be "no", because you have shown zero actual usage.

> William Overington
 
> 12 November 2012 

-Van Anderson




texteditors that can process and save in different encodings

2012-10-04 Thread vanisaac
From: Stephan Stiller 
> Dear all,
> 
> In your experience, what are the best (plaintext) texteditors or word
> processors for Linux / Mac OS X / Windows that have the ability to *save* in
> many different encodings?
> 
>  This question is more specific than asking which editors have the best
> knowledge of conversion tables for codepages (incl their different
> versions), which I'm interested in as well. There are a number of programs
> that appear to be able to *read* many different encodings – though I prefer
> the type that actually tells me about where format errors are when a file
> is loaded. Then, many editors that claim to be able to read all those
> encodings cannot *display* them; as for that, I don't care about font
> choice and the aesthetics of display, as I'm only interested in plaintext.
> 
> Some things I have seen that are no good:
> 
>- the editor not telling me about the encoding and line breaks it has
>detected and not letting me choose
>- the editor displaying a BOM in hex mode even if there is none (a
>version of UltraEdit I worked with at some point)
> 
>  Stephan 

I've never fully explored the vagueries of its code page detection, but I use 
Notepad++ notepad-plus-plus.org due to the ability to explicitly define Unix vs 
Windows line breaks, BOM vs BOMless, and over 50 supported code pages and 
encoding forms. 

Van Anderson




Re: Unicode Core

2012-06-22 Thread vanisaac
From: Ken Whistler 
> To echo what Michael said here, the editors are looking into this.
> 
> We did, in fact, do the work to volumize the entire set of charts, including
> all of CJK, for POD, and even made volume covers and title pages.
> However, it turned out that Lulu had production issues for at least some
> of those volumes. So at the last minute we had to limit the POD to
> just the core specification, which didn't cause printing problems.
> 
> It was an interesting experiment, and we learned some lessons from it.
> But we simply do not have the bandwidth to finish wrestling with it for
> Unicode 6.1 right now. (The Unicode 6.2 beta is underway, and the people
> involved with charts need to focus on getting Unicode 6.2 charts prepared.)
> 
> I anticipate that once Unicode 6.2 is done, the editors may take another
> crack at this, and manage to create volumes for charts with settings that
> won't make Lulu production printers crash and burn. But all in good time.
> 
> --Ken 

Wait a minute. Isn't 6.2 just adding the Turkish Lira? Does that really take 
the chart people more than about 10 minutes?

-Van

PS, interesting that you had production issues on doing the code charts as 
print-on-demand. I guess that's not quite as straightforward a process as you 
would think.




Re: Unicode Core

2012-06-21 Thread vanisaac
From: Michael Everson 
> On 21 Jun 2012, at 09:47, Raymond Mercier wrote:
> 
> > While I am very glad to have this, I really do wonder why there was not a 
> full publication of Unicode 6 or 6.1 from the corporation itself, with all 
> the > charts, as we have had with Unicode 1 to 5. Surely there is a market 
> for this ?
> 
> Perhaps less than us character mavens would imagine. Books don't publish 
> themselves, and publishing takes resources of various kinds.
> 
> But I understand that the Powers That Be are looking into the matter.
> 
> Michael Everson * http://www.evertype.com/ 

Not to mention, it would be freaking HUGE! The Core specification is published 
at 600+ pages, code charts are another 2000+, and all the UAXs, Radical-Stroke 
indices, etc would push it to well over 3k pages. Even if you didn't list CJK 
code charts, you are still looking at a good 1500-2000 pages. Not that I don't 
have a vested interest in getting this to happen for future versions, but 
publishing Unicode in its entirety is an undertaking getting orders of 
magnitude more difficult to accomplish each year.

Van




complex rendering (was: Re: Mandombe)

2012-06-11 Thread vanisaac
From: Szelp, A. Sz. 
> On Mon, Jun 11, 2012 at 10:58 AM, Stephan Stiller 
> wrote: 
> 
> > 
> > This is interesting only if the encodable elements would be different -
> > remember, Unicode is not a font standard.
> >
> > +1; rendering can be so much more complex than encoding. I'd really like
> > to see a successful renderer for Nastaliq, (vertical) Mongolian, or
> > Duployan. (What *are* the hardest writing systems to render?)
> >
> >
> Vertical mongolian does not seem to be harder to render _conceptually_
> than, let's say, simple arabic. It's more the architectural limitations of
> rendering engines that seem to limit its availability, and the intermixing
> with horizontal text. For Nastaliq, Thomas Milo's DecoType is miraculous:
> it's hard, but given the good job they did, obviously not impossible. —
> Well, I don't know about Duployan.
> 
> /Sz 

I guess this is my invitation to chime in. I'm close to releasing a beta of a 
Graphite engine for (Chinook repertoire) Duployan, using a PUA encoding. By the 
release of 6.3/7.0, we should have a working implementation of Unicode 
Duployan/shorthand rendering for Graphite enabled applications. Like a Nastaliq 
implementation, it's convoluted and involved, but not impossible. It will not, 
however, be nearly as beautiful as DecoType; I'm not a designer at heart, and a 
Duployan implementation as stunning as Milo's Nastaliq will require the skills 
of people several orders of magnitude more talented than I.

-Van Anderson




Re: U+02D0 and U+01C3

2012-05-21 Thread vanisaac
From: Chigurupati, Nagesh 
> 
> Hello Unicode Folks,
> 
> U+01C3 looks like an 'Exclamation Mark' and it is categorized as 'Letter 
> Other'.
> U+02D0 looks like two inverted triangles and it is categorized as 'Letter 
> Modifier'.
> 
> These code points being categorized as they are would not prevent them in an 
> IDN. However, similar looking code points which are categorized as symbols or 
> punctuation marks (U+0021 and U+003A) would be disallowed in an IDN (IDNA 
> 2008).
> 
> My question is:
> 
> Should the code points be categorized differently so that they are disallowed 
> by IDNA2008 RFCs, or
> Should the individual end user applications prevent such kind of characters 
> in 
> an IDN.
> 
> Thank you.
> 
> Regards,
> Nagesh 

These characters are used as letters in different orthographies; Ok, U+02D0 is 
only IPA, AFAIK, so it's an edge case. Nevertheless, the whole idea behind IDNs 
is that they allow domain names in languages other than English, so eliminating 
characters necessary for writing a particular language, simply based on their 
/looking/ like unallowed punctuation characters, would be counterproductive to 
the entire point of IDNs. Does it present challenges? Sure. But Unicode was 
developed to meet just those kinds of challenges.

-Van




Re: Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-17 Thread vanisaac
From: Mark Davis ☕ 
> On Wed, May 16, 2012 at 9:20 PM,  wrote:
>> From: Ken Whistler 
>> > Orthographies which mix in random characters from other scripts do not
>> > (or should not) drive the identity of characters for *scripts* per se.
>> > And edge cases for making mixed script collation work should not drive
>> > such decisions, either.
>> >
>> > --Ken
>>
>> Anyway, that's what ScriptExtensions.txt is for.
>>
>> -Van
> 
> No, it's not.
> 
> Including x in Lao for some pedagogical (I'm guessing) purpose is
> completely out of scope. That'd be like including π in Latin because it
> sometimes occurs in the middle of English text.
> 
> Mark 

Well, I was speaking of the general case, not this specific example. 
Orthographies which mix in random characters from other scripts do not, and 
should not, drive the identity of characters for scripts, per se. If you need 
to indicate a random character from another script used in a particular 
orthography, Script Extensions is the mechanism that should probably be used, 
rather than assigning a character that firmly belongs in one script to 
script=common.

Is that better, Mark?

-Van




Fw: Re: Compliant Tailoring of Normalisation for the Unicode Collation Algorithm

2012-05-16 Thread vanisaac
From: Ken Whistler 
> On 5/16/2012 2:54 PM, Richard Wordingham wrote:
> > I have been wondering if U+0078 LATIN
> > SMALL LETTER X should be made common script because of its use for
> > displaying Lao vowels, but perhaps the principle of separation of
> > scripts should lead to LAO LETTER SMALL X.
>
> Please, no! ;-)
>
> Orthographies which mix in random characters from other scripts do not
> (or should not) drive the identity of characters for *scripts* per se.
> And edge cases for making mixed script collation work should not drive
> such decisions, either.
>
> --Ken

Anyway, that's what ScriptExtensions.txt is for.

-Van




Re: Upside Down Fu character

2012-01-09 Thread vanisaac
From: Asmus Freytag 
> On 1/8/2012 1:41 PM, Doug Ewell wrote:
> > I think "if this were encoded, I think people might want to use it" was 
> explicitly not a reason to encode something.
> 
> Doug,
> 
> I think you are possibly overstating this slightly.
> 
> As often quoted, it's a maxim intended to guard against encoding
> characters for which there is no practical need (and which, perhaps,
> only the proponent wishes to use as characters, while other users tend
> to not use it in text, use graphics, etc.).
> 
> In particular, it seems to apply best in situation where it is the
> *only* argument made in favor of encoding something.
> 
> However, there are many situations, even involving things are clearly
> legitimate characters, where the following, almost identical statement
> turns out to hold:
> 
> "if this were encoded, I think *more* people might want to use it" (or
> "will use it")
> 
> Restated in this manner, it's just a truism, therefore neither an
> argument for or against encoding something.
> 
> As presented below the argument appears to actually be something more like:
> 
> "if this were encoded, I think people would use it in electronic data,
> not just print, handwriting, etc."
> 
> On the face of it, the statement isn't that far different from the
> earlier lines. However, instead of being a warning against encoding,
> it's one of the standard rationales for it: if an entity exists in
> traditional forms of text, but not digital data, then the lack of
> encoding is a plausible explanation for that fact, and encoding the
> character would allow Unicode to cover such textual context.
> 
> I have no opinion on the Upside-down FU ideograph as a candidate for
> encoding, but I think any analysis of its merits needs to be more
> nuanced than what your message seemed to imply.
> 
> A./

While I generally agree with your more nuanced view on this matter, Asmus, I'm 
afraid I have to disagree in this particular case. The upside down Fu has been 
used decoratively for a thousand years (it's a Chinese pun), and if anyone 
wanted to use it in plain text, they would have by now. With a character of 
such antiquity, there really is no question of computer technology suppressing 
its use. Put simply, people have either used this character in plain text, or 
they haven't. If someone can dig up a couple example texts, then it's no 
question. If nobody can find those example texts, I think that speaks volumes 
on the utility of the character and its suitability to encoding.

-Van




Re: Upside Down Fu character

2012-01-04 Thread vanisaac
From: Otto Stolz 

> Hello,
> 
> I have tested the textUpsideDown definition from
> 
> with three browsers:
>   Firefox 8.0.1,
>   Opera 10.52;
>   Internet Explorer 8.0.6001.187702
> The latter asks for the user’s consent to interpret scripts,
> before it applies the .txtUpsideDown class definition.
> Cf. attached source file, and attached screen shot for the results.
> 
> As  already has observed,
> the upside-down text either appears below the current line,
> or overlaps other text in the current line. My test case
> shows, that in the former case, the upside-down text will
> overlap the following line (in this test a horizontal ruler).
> 
> Best wishes,
>Otto Stolz 

I think I may have it figured out, at least part way. The code


.txtUpsideDown
{
filter:  progid:DXImageTransform.Microsoft.BasicImage(rotation=2);  /* IE6,IE7 
*/
ms-filter: "progid:DXImageTransform.Microsoft.BasicImage(rotation=2)"; /* IE8 */
-moz-transform: rotate(-180deg);  /* FF3.5+ */
-o-transform: rotate(-180deg);  /* Opera 10.5 */
-webkit-transform: rotate(-180deg);  /* Safari 3.1+, Chrome */
position: relative;
}

福福福

seems to work - setting "position: relative" instead of absolute and using 
 instead of . It works in Firefox, although when I run it on IE8, it 
doesn't work for some reason. I'm still trying to work out the bugs.

-Van




Fw: Upside Down Fu character

2012-01-04 Thread vanisaac
From: philip chastney 

> From: Michael Everson 
> Rick McGowan wrote:
>> I would say to use higher level mark-up or images for this. I don't see any 
>> reason to start down the road of encoding upside down Chinese characters, or 
>> variation sequences, for such things. They are decorative anomalies, not 
>> plain 
>> text. What's the inline markup for "display this glyph upside down"? 
>>

> this will do the job, though whether it meets the requirements of a 
> non-programmer, I don't know http://www.codeproject.com/Tips/166266/Making-
> Text-Upside-down-u it looks to the user like straightforward CSS, but needs 
> maintaining as OSs and browsers shift and change /phil 

I tried out this code, it's simple html/CSS, but it doesn't seem to do a good 
job of rendering in-line text. Specifically, the text "福
福" renders two right-side-up 福 characters on one line, with the upside-down 福 on the next line. OTOH, "
福" renders the right-side-up and upside-down 福 characters overlapping each other at the top left of the page, requiring several NBSP to separate the two characters. I don't know enough CSS to tweak the code (Ok, I don't really know any, but I can follow simple code in most any language), but with some work, this could probably be the basis of doing in-line upside down text. Test of the above code code (ignore if this is gibberish): .txtUpsideDown { filter: progid:DXImageTransform.Microsoft.BasicImage(rotation=2); /* IE6,IE7 */ ms-filter: "progid:DXImageTransform.Microsoft.BasicImage(rotation=2)"; /* IE8 */ -moz-transform: rotate(-180deg); /* FF3.5+ */ -o-transform: rotate(-180deg); /* Opera 10.5 */ -webkit-transform: rotate(-180deg); /* Safari 3.1+, Chrome */ position: absolute; size=24pt.; font=SimHei; } 福福福 Van Anderson

Re: tips on writing character proposal

2011-11-09 Thread vanisaac
From: Larson, Timothy E. 

> Hello!
> 
> I'm new here, but have already read some of the online documentation for 
> proposing new characters. I'm still a bit unsure how to go about it. Or even 
> who can do it. Can individuals submit ideas, or do you need to be the 
> representative of some agency or group? How much supporting background 
> information is deemed sufficient? Where do I find details (more than just the 
> pipeline table) of current pending proposals?

You absolutely do not need to be a representative of any company, government, 
organization, or group. I am in no way associated with any associated entity 
and successfully proposed a script with ~150 characters. All it takes is a 
dedication to serious research, a large amount of time to dedicate to the 
process, and the tenacity and perseverance to see a long and arduous process 
through to the end. The ability to produce PDFs is helpful, but not necessary,
too.

You can take a look at a large number of proposal documents from June by 
following links at the document register http://std.dkuug.dk/JTC1/SC2/WG2/docs
/n4000.pdf . Note that many of the documents are commentaries, opinions, or 
discussions of proposals. Look for any documents called something like 
"Proposal to encode X" or "Preliminary Proposal to encode X". Note that 
preliminary proposals will necessarily be incomplete.

[snip]

> Thank you,
> Tim 

You're welcome,
Van




Re: N4106

2011-11-07 Thread vanisaac
From: Kent Karlsson 
Den 2011-11-05 04:23, skrev "António Martins-Tuválkin" :

> > I'm going through N4106 ( http://std.dkuug.dk/jtc1/sc2/wg2/docs/n4106.pdf ),
> ...
> 
> I see the following characters being put forward for proposing to be
> encoded:
> 
> 1ABB COMBINING PARENTHESES ABOVE
> 1ABC COMBINING DOUBLE PARENTHESES ABOVE
> 1ABD COMBINING PARENTHESES BELOW
> 1ABE COMBINING PARENTHESES OVERLAY
> 
> Well, COMBINING DOUBLE PARENTHESES ABOVE seems to be the same as  PARENTHESES
> ABOVE, COMBINING PARENTHESES ABOVE>. And COMBINING PARENTHESES OVERLAY seems
> to be just
> a tiny parenthesis before and a tiny parenthesis after; no need for a
> combining mark, especially one with
> a splitting behaviour.
> 
> Otherwise, I think COMBINING ((DOUBLE)) PARENTHESES ABOVE/BELOW are an
> entirely new brand of
> characters in Unicode (if accepted as proposed). They are supposed to split
> (ok, we have split
> vowels in some Indic scripts, more on that below), but these split around
> *another combining mark*.
> So despite being given (as proposed) vanilla above/below mark properties,
> they do not "stack" the
> way such characters normally do, but is supposed to invoke an entirely new
> behaviour.

I agree, except that if we give them any but a ccc=220/230, then canonical 
reordering will separate them from the modifier letters that they are attached 
to. I think this is one of those cases where a definition needs to expand in 
order to accommodate architecture. We do already have some non-stacking 
behaviour defined for these characters in order to accommodate polytonic Greek, 
so we do have some experience with disparate appearances of consecutive marks.

> That supposedly stacking combining marks *sometimes* (more a font dependence
> than a character
> dependence) don't stack but instead are laid out linearly is not new. But to
> *require* non-stacking
> behaviour for certain characters is new.

Then think of it as the "non-spacing" version of stacking behaviour.

> So we have a combination of:
> 
> 1. Splitting. (Normally only used for some Indic scripts).
> 
> 2. Indeed splitting with no other characters to use for the decomposition,
> thus requiring the use of
>PUA characters, to stay compliant, for representing the result of the
> split at the character level.
>(This is entirely new, as far as I can tell.)

I cannot imagine in any way how this requires PUA characters.

> 3. The split is entirely *within* the sequence of combining characters
> (except for COMBINING
>PARENTHESES OVERLAY, which behaves as split vowels normally do, but still
> with issue 2), not
>around the combining sequence including the base. (This is entirely new.)
> 
> 4. Requiring (if at all supported) to use linear layout of combining
> characters instead of stacking.
>(This is entirely new.)

If I were designing a font, I would simply make the in/out mark attachment 
point near the top/middle of the parentheses, so that it drops down around the 
"base" mark, and then attaches any subsequent marks as if the parentheses 
weren't there. I think you're making this too complicated.

> This makes these proposed characters entirely unique in their display
> behaviour, IMO.

I do, however, agree totally with this assessment, I just believe it is more 
manageable than you paint it.

[snip]
> /Kent K 

I do, myself, have a couple of concerns in regards to several proposed 
characters in N4106 as well. Namely, I believe that U+1DF2, U+1DF3, and U+1DF4 
should require significant justification as to why they should not be encoded 
as U+0363 + U+0308, U+0366 + U+0308, and U+0367 + U+0308. I have similar 
concerns about U+A799, U+AB30, U+AB33, U+AB38, U+AB3E, U+AB3F, etc.

Van A




Re: Private Use Variation Selectors (was: RTL PUA?)

2011-08-19 Thread vanisaac
From: Michael Everson  
> On 19 Aug 2011, at 15:51, Shriramana Sharma wrote: 
> 
>> On 08/19/2011 08:11 PM, vanisaac_at_boil.afraid.org wrote: 
>>> why there weren't private use Variation Selectors. 
>> 
>> Because you are already free to use PUA codepoints as VSs? 
> 
> Because the existing VSs are sufficient? 
> 
> Michael Everson * http://www.evertype.com/

Quote from 16.4: Standardized variation sequences are defined in the file 
StandardizedVariants.txt in the Unicode Character Database. Ideographic 
variation sequences are defined by the registration process defined in Unicode 
Technical Standard #37, “Unicode Ideographic Variation Database,” and are
listed in the Ideographic Variation Database. Only those two types of variation
sequences are sanctioned for use by conformant implementations. In all other 
cases, use of a variation selector character does not change the visual 
appearance of the preceding base character from what it would have had in the 
absence of the variation selector.

This seems to not allow for private compact use of Variation Selectors. Am I 
missing something here?

-Van




Re: RTL PUA?

2011-08-19 Thread vanisaac
From: Michael Everson  

> On 19 Aug 2011, at 14:29, Petr Tomasek wrote: 
> 
> > I would like to ask why there are no PUA parts which would be reserved for 
> > RTL scripts (i.e. would have the directionality set to "strong RTL"). 
> > 
> > Thanks! 
> > 
> > P.T. 
> 
> This is a very good question. 
> 
> Michael Everson * http://www.evertype.com/ 

And maybe this is a slightly more involved conversation, but I've always 
wondered why there weren't private use Variation Selectors. I'm thinking of all 
those old Byzantine Greek fonts that have a dozen different omicron+sigma 
ligatures, and each one has to be allocated to a random PUA code point where it 
loses its semantics, rather than Omicron + ZWJ + Sigma + PVS-X, where you can 
still search for Omicron+Sigma and find the thing. Just a thought.

-Van




Re: Obelus and Metobelus characters?

2011-08-12 Thread vanisaac
From: Petr Tomasek  

> Hello! 
> 
> I'm trying to find out if there are "somwhere in Unicode" the 
> characters used in ancient greek textual criticism, the so-called 
> "obelus" and "metobelus signs". See this: 
> 
>   http://www.etf.cuni.cz/~tomasek/obl.jpg 
> 
> I found following signs, none of them seems to be exactly 
> what I'm looking for: 
> 
>  U+070B SYRIAC HARKLEAN OBELUS 
>  U+070C SYRIAC HARKLEAN METOBELUS 
> 
> The obelus seems like to be the right shape, but it is a RTL 
> character. I further found these suggested to have the 
> function of obelus, but they have quite different shapes: 
> 
>  U+2020 DAGGER 
>  U+00F7 DIVISION SIGN 
> 
> My question is thus: have I missed something? 
> 
> Thank You! 
> 
> P.T. 
> -- 
> Petr Tomasek 

U+2E13 - Dotted Obelos and
U+2E14 - Downwards Ancora

-Van Anderson




Re: How is NBH (U0083) Implemented?

2011-08-03 Thread vanisaac
From: Jukka K. Korpela  
> vanisaac_at_boil.afraid.org wrote: 
> 
> > Actually, ZWNBSP is no longer suggested since Unicode 3.2. Instead, 
> > Word joiner (U+2060) is used for simply preventing line breaks. 
> > ZWNBSP should only be used for its BOM semantic in new texts, but 
> > should still be interpreted as inhibiting line breaks. (ch. 16.2, v. 
> > 6.0) 
> 
> So in effect, ZWNBSP still means "don't break", though the standard says 
> that so does the WORD JOINER and recommends that it be used instead. In 
> practice, there is hardly any system that does not implement the ZWNBSP 
> semantics but implements the WORD JOINER semantics, but there are systems 
> that do the opposite. This makes it easy to decide which one is safer to 
> use. 
> 
> Yucca

If your assumption is that the standard isn't being implemented, what is the 
point of having the standard in the first place? If your goal is to produce 
non-confomant texts, then just ignore the standard, but don't come back here 
and whine that conformant implementations are messing up your text. If, on the 
other hand, your goal is to produce conformant texts then you need to follow 
the standard, whether or not a particular implementer is conformant or not. In 
addition, producing conformant text allows you to provide bug reports to 
non-conformant developers, thus aiding in the implementation of the standard.

-Van




Re: How is NBH (U0083) Implemented?

2011-08-02 Thread vanisaac
From: Jukka K. Korpela  
> Naena Guru wrote: 
> 
> > There is also the NBSP (No-break Space: U00A0), which I think has to 
> > be mapped to the space character in fonts, that glues two letters 
> > together by a space. 
> 
> NBSP has defined semantics in Unicode, but it can be implemented in 
> different ways (it could have a glyph of its own). 
> 
> > NBH is more appropriate for use within ISO-8859-1 characters than 
> > ZWNJ, because the latter is double-byte. 
> 
> ZWNJ is not an ISO-8859-1 character at all. In Unicode, it is a control 
> character that prevents the use of a ligature. The character to prevent line 
> breaking, with no other effect, is ZERO-WIDTH NO-BREAK SPACE. 
> 
> Whether NBH works at all really depends on the application, and I would not 
> expect applications to support it in practice, irrespective of character 
> encodings. 
> 
> Yucca 

Actually, ZWNBSP is no longer suggested since Unicode 3.2. Instead, Word joiner 
(U+2060) is used for simply preventing line breaks. ZWNBSP should only be used 
for its BOM semantic in new texts, but should still be interpreted as 
inhibiting line breaks. (ch. 16.2, v. 6.0)

-Van




RE: Prepending vowel exception in Lontara/Buginese script ?

2011-07-23 Thread vanisaac
From: Peter Constable 

> Van is mistaken in his understanding of OpenType Layout. There is no 
> mechanism to describe re-ordering in OpenType Layout tables in a font. That 
> must be handled by the OTL client software.
> 
> Peter

Well crapola. So that reordering is completely controlled by Unisribe, Peter? 
Then what is the pre-base forms/substitution OT features for? I had always 
thought that vowel reordering was their purpose.

-Van




Prepending vowel exception in Lontara/Buginese script ?

2011-07-23 Thread vanisaac
From: verdy_p  
> 
> If I look in the Unicode 6.0 charts for the Buginese script, I see that 
> vowel /e/ (U+1A19) is prepended visually on the left of the base consonnant 
> to which it applies. This should mean that the vowel has to be encoded 
> ilogically in texts AFTER the base consonnant to which it applies.

It actually IS encoded logically, just not visually. Logically, the E comes 
after after the consonant sound. That's the reality that Unicode reflects. The 
fact that the script prepends this vowel mark before the consonant doesn't 
change that it logically comes after the consonant.

> However, I have tested all fonts available on the web for this script, and 
> none of them contain the necessary OpenType substitution feature needed to 
> make the logical-to-visual reordering. 
> 
> Is this a bug of these fonts (most of them are TrueType only, not OpenType 
> with a reordering feature like those used in other Indic scripts, but built 
> like basic TrueType fonts for Thai, Lao and Tai Viet scripts, that are the 
> only scripts for which Unicode has defined the "Prepended Vowel" exception)?
> 
> Or is is a bug/limitation of text renderers ? 

If a script has prepended vowels, the fonts should have OpenType features 
enabled. It is absolutely a bug with the fonts.

> I note for example that Chrome correctly uses Unicode 6.0 default grapheme
>  cluster boundaries, when editing and selecting in Lontara text (written in 
> Biginese or Makassar languages), so that the vowel will be selected/deleted 
> logically along with the base character encoded before it (for example a 
> space or punctuation, or even a HTML syntax character). But if I use this 
> browser to display Lontara text, the vowel /e/ is still shown with the 
> diacritic on the right of the base consonnant (or dotted circle symbol), 
> meaning that the text is garbled when I use any one of those available fonts.
> 
> All texts in Makassar or Buginese I have found, encoded in Unicode, seem to 
> assume the visual order (i.e. the same "prepended vowel" exception as in 
> Thai and Lao). Given the geographical area where the Lontara script is
> mostly used (Indonesia and Thailand), it seems quite logical that text 
> authors assumed this exception to the logical encoding order. 
> 
> What can be done? Should the fonts be corrected to include the OpenType 
> feature,

Yes.

>  or should Unicode be modified to inclide the "prepended vowel" 
> exception

No.

> also for Buginese, and so the default grapheme boundaries modified 
> as well, and the Unicode 6.0 chart modified too for U+1A19 ?

No.

> -- Philippe. 

Van




Re: ? Reasonable to propose stability policy on numeric type = decimal

2010-07-27 Thread vanisaac
From: Kenneth Whistler (k...@sybase.com)

> C. E. Whitehead said: 
> 
> > I've not gone through many character charts though so I can't 
> > really speak as an expert as you all can; sorry I've not gotten 
> > to more; I will try to ... 
> 
> For people who wish to pursue this issue further, the relevant 
> information is neatly summarized in the extracted property 
> data file: 
> 
> http://www.unicode.org/Public/UNIDATA/extracted/DerivedNumericType.txt 
> 
> That is what you should look at for efficiency, and 
> is basically what the UTC would be using for discussion 
> about this matter. 
> 
> --Ken 

C.E.

Specifically, notice New Tai Lue numbers (U+19D0-U+19DA). We have a sequence of 
eleven gc=Nd, that absolutely cannot be arranged so that consecutive code 
points have ascending numeric values. I doubt that if Arabic were encoded today 
that there would be a full set of Eastern digits, only 4-7, with 0-3 and 8 & 9 
sharing with the regular Arabic digits. This leads me to the conclusion that 
any formal policy is inviting definitionally insoluble problems in future 
encodings - collision between encoding each character only once, and having a 
mathematically pure digit sequence.

That having been said, I have absolutely no problem with reserving a code point 
for zero, especially when a script is still in current use by a modern language 
community. Even if usage has not been place-value before, it is a simple 
adaptation for a script when its user community is exposed to global business, 
scientific, and standards communities.

Even though I have no official say, as a script encoder, my vote would be to 
simply recommend that decimal digits be sequentially ordered 0-9, and to leave 
a reserved code point if the system is in modern use but does not currently use 
place-value, and hence have a digit zero. I would explicitly fight against 
anything more formal, as it would unnecessarily encumber script encoders who 
have to balance a lot more interests than just programmers who won't provide 
for an exception branch for non-sequential number arrangements. You've gotta do 
it anyway, for CJK and New Tai Lue. I would also question any programmer who 
wouldn't allow for mixing of the two blocks of Arabit digits. Just leave the 
code open for future additions, just as you do for the sequential/ascending 
numbers.

-Van Anderson




re: Using Combining Double Breve and expressing characters perhaps as if struck out.

2010-07-24 Thread vanisaac
Guys, does nobody read the bloody Standard anymore!?

You CAN currently add a diacritic on top of a double diacritic. The "other" 
base character is called the Combining Grapheme Joiner (U+304F).

>From V. 5.0, ch 7.9:

Occasionally one runs across orthographic conventions that use a dot, an acute 
accent, or other simple diacritic above a ligature tie - that is, U+0361 
Combining Double Inverted Breve. Because of the considerations of canonical 
ordering [...], one cannot represent such text simply by putting a combining 
dot above or combining acute directly after U+0361 in the text. Instead, the 
recommended way of representing such text is to place U+034F Combining Grapheme 
Joiner (CGJ) between the ligature tie and the combining mark that follows it, as

0075 + 0361 + 034F + 0301 + 0069 .

Because CGJ has a combining class of zero, it blocks reordering of the double 
diacritic to follow the second combining mark in canonical order. The sequence 
of  is then rendered with default stacking, placing it centered 
above the ligature tie. This conventiona can be used to create similar effects 
with combining marks above other double diacritics (or below double diacritics 
that render below base characters).


"Philippe Verdy" wrote: 
First encode each base (unjoined) extended grapheme clusters 
separately (possibly with their own diacritics or extenders or 
prependers, including ZWJ and ZWNJ, according to their definition in 
the UAX defining text segmentations). 


Then encode the double diacritic between them. 


So for your examples you get <006F, 035D, 006F> (double breve) or 
<006F, 035D, 006F> (double macron). 


Double diacritics have a combining property equal to zero, so they 
block the reordering for canonical equivalences and the relative order 
and independance for the encoding of base grapheme clusters will be 
preserved during normalizations. 


As a consequence, if there's another diacritic added on top of the 
double diacritic, it can only be added at end of this sequence, but 
the bad thing is that it will appear just after the encoding of the 
second base grapheme cluster, and so it is subject to reordering, as 
it will be interpreted as being part itself of the second grapheme 
clusters. 


Currently you cannot add another diacritic on top of a double 
diacritic, we lack something for blocking such interpretation in the 
second cluster. 


To do that, we would need another base character with combining 
property 0 (blocking canonical reorderings), and that would have the 
same grouping semantic as other double diacritics. This character 
would be abstract (and invisible by itself) and could be something 
like: 


  U+xyzt DOUBLE DIACRITIC HOLDER 


For example to add an acute accent above the double breve joining the 
two letters 'o', we would encode: 


  <006F, 035D, 006F, xyzt, 0301> 


instead of just <006F, 035D, 006F, 0301> which is canonically 
equivalent to <006F, 035D, 00F3> and which encodes the letter 'o' and 
the letter 'o' with an acute accent (centered on this second o) joined 
with the double breve *above* the acute accent of the second 'o'. 


My opinion is that such double diacritic holder exists: it's ZWJ, 
which could be safely used as the needed invisible base for additional 
diacritics occuring on top (and centered) of a double diacritic. But 
others may have other preferences about the choice of this character. 


I don't know if ZWJ has been specified so that it could occur only 
before a "defective" combining sequence containing only combining 
diacritics. for this case, this would mean that the semantic of the 
combining diacritics encoded after it must apply to the full part of 
the extended grapheme cluster encoded before it. 


This use of ZWJ effectively allows the interpretation of the encoded 
sequence as if it was in TeX syntax: 


  \acute{ \breve{oo} } 


Without the ZWJ, it would be interpreted as: 


  \breve{ o\acute{o} } 


The double diacritics or just intended to be used between each base 
grapheme clusters to join. And it could possibly be used to groop more 
than 2 base grapheme, for example with 3 'o' as: 


  <006F, 035D, 006F, 035D, 006F> 


interpreted in TeX syntax as: \breve{ooo} 


But even with this case, you wont be able to encode with the ZWJ trick 
in plain text, such groupings that are expressed this way in TeX: 


  \breve{ \breve{oo} x \breve{ o\acute{o} } } 


Because double diacritics encoded in Unicode can't be safely stacked 
together (for such application you'll need a rich-text layer on top of 
Unicode, such as TeX here). 


Philippe. 



verdy_p (verd...@wanadoo.fr) wrote:


I just thought about a solution to allow stacking of double-diacritics: we 
could use variation selectors after them, 
to specify a higher level of grouping. 


So in the example above: 
- "\breve{o

RE: Indian Rupee Sign (U+20B9) proposal

2010-07-20 Thread vanisaac
From: CE Whitehead (cewcat...@hotmail.com)
> For my two cents, 
> I agree with Asmus and Van; pre-implement the final code point; 
> thanks also Michael for separating the glyph from the rupee symbol. 
> Best, 
> C. E. Whitehead 
> cewcat...@hotmail.com 

Gah! I didn't say that! I just think that v. 6.0.x should be release directly 
after 




RE: Indian Rupee Sign (U+20B9) proposal

2010-07-20 Thread vanisaac
> For my two cents, 
> I agree with Asmus and Van; pre-implement the final code point; 
> thanks also Michael for separating the glyph from the rupee symbol. 
> Best, 
> C. E. Whitehead 
> cewcat...@hotmail.com 

I didn't say that. All I suggest is releasing v. 6.0.x, preferably the day that 
the WG2 approves the new Rupee sign or as soon thereafter as a new code chart 
can be uploaded. I don't necessarily think that font vendors probably should 
map to the new code point before then, and definitely not before Redmond. Amsus 
may have a bit more relaxed attitude, borne from experience, but I actually 
take the conformance rules fairly strictly.

-Van

PS, sorry for the accidental partial post.




Re: Latin Script

2010-06-14 Thread vanisaac
From: Tulasi 

> Thanks for the input Edward!
> Yep, I shell explore time-chronology as well.
> 
> Edward -> Close, but not quite. Consider LATIN SMALL LETTER PHI (ɸ).

Amazingly, I consider Latin Small Letter Phi to be a part of the Latin script. 
Why?: in my typographic life, I would design it differently from Greek small 
Letter Phi. The Greek phi needs to work with other Greek letters. The Latin phi 
needs to work in phonetic notation, which is Latin letters; it needs to have 
more contrast with Latin Small Letter Q than the Greek phi, so it has an 
ascender. As a Classicist, a Greek phi with an ascender interrupts the flow of 
text, unless in a slant font, so it is designed quite differently from Latin 
Small Letter Phi. It's just like Cyrillic Dze and Sha, which have been borrowed 
from Latin and Coptic, are designed and act like Cyrillic letters.

> Mark gave a new link of letter/symbol that has LATIN (thanks Mark!):
> Mark -> 
> http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:script=Latn:]&g=age
> 
> Now, how many letters/symbols in that link are like "LATIN SMALL
> LETTER PHI (ɸ)", i.e., not from Latin-script?

there's really no way to make any sort of distinction like that. Do you want to 
consider Y and Z as not Latin letters, because they were borrowed from Greek, 
not adapted from Etruscan? How about Þ and Wynn? They are from Runic. Should 
U+019B, Latin Small Letter Lambda with Stroke be considered not Latin, even 
though it is not found in any other script? There are a number of these, and 
the only classification that is not completely arbitrary is to consider them 
ALL to be part of the Latin script, including Latin Small Letter Phi.

> Also, how do I find the list of letters/symbols that do not have LATIN
> in names but from Latin-script?

The Spacing Modfier Letters and Combining Diacritical Marks may also need to be 
included for a really comprehensive list, and these are contained in their own 
blocks, Phonetic Extensions, and Phonetic Extensions Supplement. Then the 
question is whether you should include Devanagari Om. What about Currency 
signs? Punctuation? Should it simply be the union of Script=common and 
Script=Latin? Script=common includes puntucation from all languages, so you end 
up with Dandas and Arabic commas, is that right? The question really only makes 
sense if it has context: for what purpose are you defining something as Latin 
script?

> Tulasi

Van




Re: Writing a proposal for an unusual script: SignWriting

2010-06-11 Thread vanisaac
I've already sent specific comments to Steve, but a few items that I would like 
put on the record are as follows.

From: steve (slevin...@gmail.com)
> Greetings Unicode List, 
> 
> I'm working on a character encoding model for SignWriting. I just 
> finished the 3rd major revision. Instead of needing an entire plane, I 
> only need 1280 code points. Since 1024 code points have already been 
> tentatively reserved for SignWriting on the SMP (1d8 thru 1db), I'm very 
> happy with my latest update. 
> 
> SignWriting is an unusual script because it does not follow the same 
> rules as other script. I do not believe SignWriting can be changed to 
> follow the same rules without breaking the script. 

I disagree. In the end, Sutton SignWriting has a logical order, and the 
elements combine together in predictable ways. In some ways, it calls for a 
somewhat more abstract encoding model than other scripts, but it has text flow, 
recurring elements, and rules for proper syntax. Having experience with 
shorthands - another set of writing systems in which position may need to be 
defined - I can say that the character encoding model is more powerful than you 
can probably see right now. It certainly was for me.

> Before discussing the primary difference, I'd like to stipulate three 
> assumptions. First, sign languages are real human languages. Second, 
> sign language can be written. Third, SignWriting can write sign 
> language. 

I think anyone with even a cursory knowledge of the subject matter would agree 
wholeheartedly.

> We have an international user base. People write by hand or computer. 
> We have tens of thousands of signs in different sign languages from 
> around the world. We have hundreds of documents including "The Cat in 
> the Hat" (translated with permission), whole books of the Bible, and 
> others. All of this writing can be encoded using 1280 code points. I 
> have a 12-bit encoding with bi-directional conversion with UTF-8 working 
> for planes 1, 15, or 16. I'm currently using plane 15: fd800 thru 
> fdcff. 

> The primary difference between SignWriting and other scripts is that 
> SignWriting is a spatial script. The graphemes of SignWriting are not 
> written sequentially and do not have definable attachment points. 
> Imagine a two dimensional canvas. The graphemes can be written anywhere 
> on the canvas. A completed sign (or word) will consist of several 
> graphemes spatially arranged on a canvas. There is an infinite number 
> of signs that can be created. 

Even though script elements can be written on the canvas anywhere, there are a 
limited number of /relative/ positions in which given elements can appear. 
Furthermore, from what I understand, there is a definite abstract space in 
which the elements are defined, and appearing in which defines an element in 
relation to other elements. I also take exception to the contention that there 
are an infinite number of signs that can be created: it may be many millions, 
but there is most definitely a finite number of complete signs that can be 
defined.

> A spatial script requires a coordinate system, either cartesian or 
> polar. I do not believe Unicode currently includes any spatial scripts, 
> but it is impossible to use SignWriting without a coordinate system. 

This is the big whopper that I vehemently disagree with. It may be handy to 
just define placement with coordinates, but a proper script encoding will only 
define those elements that are contrastive and salient. For signwriting, there 
will undoubtedly be numerous relative placements for hand elements (over the 
head, beside the face, chest height, wide, forward, waist height, opposite 
side, etc), but it would be truly sad if we were forced by lack of imagination 
to settle for a coordinate system.

> The character encoding model is called Binary SignWriting and is 
> documented online: 
> http://www.signbank.org/bsw 

I'm looking through it, and it seems to me to be a graphics encoding model, 
rather than a character encoding model. It will take a bit of a shift in 
thinking, but this is still a pretty good starting point.

> All of the graphemes of the script are documented and encoded. 
> http://www.signbank.org/iswa 

> I was hoping to start on the Unicode Proposal in the near future. 

> Any suggestions, comments or discussion is welcome. 

> Regards, 
> -Steve 

My best,
Van




Re: Latin Script

2010-06-07 Thread vanisaac
From: Tulasi (tulas...@gmail.com)
> How do you define Latin Script? 

Do you mean historically or pragmatically? Historically, it is an adaptation of 
the Ionian Greek (or is it Doric?), via Etruscan, for the purpose of writing 
Latin, and later extended by the addition of alternate letterforms (J, W, Þ, 
and the lower case) and diacritics to the use of western European languages and 
globally to indigenous languages in primary contact with western European 
languages that use the Latin alphabet.

Pragmatically, it is the collection of characters that are used in languages in 
conjunction with the primary collection of Roman derived letterforms as an 
alphabetic script. This means that the syllabic Fraser Lisu is not Latin 
script. Neither is Cyrillic, even though it has imported Dze and Je - the basic 
Latin alphabet does not constitute the core of Cyrillic usage.

Typographic tradition also plays a part - Greek would probably be a lot more 
ambiguous if it hadn't developed typographically among Byzantine scribes. Latin 
typography developed primarily among post-Roman and Carolignian scribal 
traditions, with offshoot blackletter and Italic scribal traditions that have 
secondary status in the modern script. Greek and Cyrillic don't share this 
history, and as such, even though they are structurally similar, they have 
evolved along different lines and constitute distinct scripts. The fact that 
you don't find languages that mix the two up is evidence of these schizms. The 
border languages choose one or the other, or they have two different 
orthographies that use each script independently of the other.

Van




RE: Greek letter "LAMDA"?

2010-06-02 Thread vanisaac
From: Kenneth Whistler (k...@sybase.com)
[snip]
> I expect that even this explanation will not satisfy those 
> who think that oddities like this should not exist in 
> character names. But that is just the nature of the 
> historical development of big standards like the Unicode 
> Standard when you have to deal with very many opinions 
> expressed by very many parties and develop consensus 
> in standards committees. You inevitably end up with 
> historical oddities. 

> --Ken 

Which is to say, every character has a story, and if you listen close enough 
and chose to understand the characters on their own terms, you just might hear 
the story. It's probably better than anything you could have made up.

-Van




Re: A question about "user areas"

2010-06-02 Thread vanisaac
From: Doug Ewell (d...@ewellic.org)
Van Anderson  wrote: 
> > Look up the Conscript Unicode Registry if you want to examine a 
> > pseudo-standardized Private Use agreement. A simple mapping table will 
> > enable you to equate your private use "standard" to the officially 
> > encoded forms of these scripts, when that time comes, if you wish to 
> > publicize - in the sense of both "enable public use" and "get the 
> > message out" - your mappings. The CSUR already has three scripts: 
> > Phaistos Disk, Deseret, and Shavian, that have migrated to the 
> > Standard, and we're waiting on Tengwar and Cirth to make the move as 
> > well; it actually seems to work quite well. 

> Coordinating the PUA code points with CSUR (and other PUA allocation 
> schemes like SIL and MUFI) would be a good idea, to reduce the risk of 
> collision. As a reminder, though, CSUR itself is for Constructed 
> Scripts, so ancient Chinese characters that are actually part of the Han 
> script probably would not belong there. 

No no no. I would not suggest anything of the sort, just that the CSUR would be 
a template for how a private use allocation can be used on an unofficial, but 
still standardized basis, and also, as you said, that avoiding overlapping the 
CSUR, SIL, and MUFI agreements would be smart. I agree totally that archaic 
Chinese characters would be inappropriate for the CSUR.

> I'm not sure how much longer we should continue to wait for Tengwar and 
> Cirth. 

I hear Michael talking about meeting with the Tokeinists every once in a while, 
so I can only assume that it is proceeding in some way.

> --
> Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
> RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s ­

Van Anderson




Re: Preparing a proposal for encoding a portable interpretable object code into Unicode (from Re: IUC 34 - call for participation open until May 26)

2010-06-02 Thread vanisaac
From: William_J_G Overington (wjgo_10...@btinternet.com)
> On Tuesday 1 June 2010, John H. Jenkins  wrote: 
>   
> > First of all, as Michael says, this 
> > isn't character encoding. 
>   
> Well, it is a collection of portable interpretable object code items encoded 
> within a character encoding as if the items were characters. 

There is a monumental gap between "items encoded ... is if [they] were 
characters" and actual characters. This gap is the gap between Unicode and not 
Unicode.

> > You're not interchanging plain text. 
>   
> True, but the items are interchanged as if they were plain text items within 
> the structure of the way that plain text is interchanged. 

Lots of things are interchanged. Machine code is interchanged, Scalable Vector 
Graphics are interchanged, executables are interchanged. None of these are 
plain text. They should not be interpreted as plain text. They should not be 
displayed as plain text, except for providing a way for those who understand 
the "text" as merely a representation of bytes of data that have a non-plain 
text meaning, so they can check the data. Object code is not Unicode, it is 
something else.

> > This is essentially machine language 
> > you're writing here, and there are entirely different venues 
> > for developing this kind of thing. 
>   
> Well, it is an object code for a virtual machine rather than a machine code 
> for a virtual machine as external name links can be included. Also, it has 
> high level language style constructs of while loops and repeat loops rather 
> than the jump to an address instructions of a typical machine code. Also, it 
> is relocatable in relation to the underlying memory structure of the host 
> computer: some machine codes can be relocatable as well, so I am not claiming 
> relocatablity as a distinguishing feature from machine code, I am just 
> mentioning the relocatability feature of the portable interpretable object 
> code. 

There is not difference between a virtual machine code and a physical machine 
code toa CHARACTER encoding standard. The fact that it has a high level 
language style means nothing, absolutely nothing. C code is C code, whether it 
is encoded as ASCII, Unicode, ISCII, Big 5, Shift-JIS, or anything else. The 
details of object code are immaterial to it being fundamentally a form of 
machine language, not a character.

> > Secondly, I have virtually no idea what problem this is 
> > attempting to solve unless it's attempting to embed a text 
> > rendering engine within plain text.  If so, it's both 
> > entirely superfluous (there are already projects to provide 
> > for cross-platform support for text rendering) and woefully 
> > inadequate and underspecified.  Even if this were 
> > sufficient to be able to draw a currently unencoded script, 
> > the fact of the matter is that it doesn't allow for doing 
> > anything with the script other than drawing.  
> > (Spell-checking?  Sorting?  Text-to-speech?) 
>   
> The portable interpretable object code is intended to be a system to use to 
> program software packages to solve problems of software globalization, 
> particularly in relation to systems that use software to process text. 
>
> > Unicode and ISO/IEC 10646 are attempts to solve a basic, 
> > simply-described problem:  provide for a standardized 
> > computer representation of plain text written using existing 
> > writing systems. 
>   
> Well, that might well be the case historically,

It is the case now.

>  yet then the emoji were invented and they were encoded.

Every writing system was invented.

> The emoji existed at the time that they 
> were encoded, yet they did not exist at the time that the standards were 
> started.

Immaterial. The question is whether they ARE plain text that is used as Plain 
text.

> So, if the idea of the portable interpretable object code 
> gathers support, then maybe the defined scope of the standards will 
> become extended. 

No. Unicode encodes plain text. Period. Emoji are no different. They are 
exchanged as plain text, and act as plain text. They were not encoded before 
they were exchanged as plain text, they were only encoded ONCE they were used 
as plain text. The key word here, and everywhere else, is "Plain text". If 
it's not Plain text, it is not, has never been, and never will be germaine.

> > That's it.  Any attempt to use 
> > the two to do something different is not going to fly. 
>   
> Well, I appreciate that the use of the phrase "not going to fly" is a 
> metaphor and I could use a creative writing metaphor of it soaring on 
> thermals above olive groves, yet to what exactly are you using the 
> metaphor "not going to fly" to refer please? 

You know perfectly well what it means, seeing as you speak native, colloquial 
English. You may think it's cute, but the people who have responded to you are 
serious people who have dedicated their lives to addressing the real issues of 
globalization, and it is both disrespectfu

Re: A question about "user areas"

2010-06-02 Thread vanisaac
From: jander...@talentex.co.uk
> I am brewing on some plans for making a font with glyphs for ancient 
> Chinese characters and even for some of the more "dubious" glyphs; I 
> assume that there is no standard area in the Unicode standard for 
> these; so where can I put them so they are least likely to clash with 
> others? 

It is very unwise to assume the characters do not have an allocation. The 
Supplementary Ideographic Plane (Plane 2) has around 47,000 currently encoded 
characters, almost all of which are either name characters, or archaic 
characters. You can look them up in the Radical/Stroke index at 
http://www.unicode.org/Public/5.2.0/charts/RSIndex.pdf .

If you are looking for Oracle Bone, Bronze, or Small Seal script characters, 
then you will need to wait until they are encoded in the Tertiary Ideographic 
plane, or make use of the Private Use areas at E000-F780, and F-1D 
(Planes 15 and 16). There may be some argument to mapping Oracle Bone, Bronze, 
and Seal script characters to their modern counterparts, and considering them 
merely a font choice. I don't know enough about the early forms of Chinese to 
say one way or the other.

Note that the Private Use Areas are the ONLY acceptable place to encode user 
defined characters - those that are not currently in the standard. It would be 
highly inappropriate and extremely unwise to decide to map your characters to 
the Oracle Bone range (U+3-U+317FF), based on the roadmaps, or to any other 
range outside the Private Use Areas. Look up the Conscript Unicode Registry if 
you want to examine a pseudo-standardized Private Use agreement. A simple 
mapping table will enable you to equate your private use "standard" to the 
officially encoded forms of these scripts, when that time comes, if you wish to 
publicize - in the sense of both "enable public use" and "get the message out" 
- your mappings. The CSUR already has three scripts: Phaistos Disk, Deseret, 
and Shavian, that have migrated to the Standard, and we're waiting on Tengwar 
and Cirth to make the move as well; it actually seems to work quite well.

-Van Anderson