Re: Preparing a proposal for encoding a portable interpretable object code into Unicode (from Re: IUC 34 - call for participation open until May 26)

2010-06-01 Thread John H. Jenkins
First of all, as Michael says, this isn't character encoding.  You're not 
interchanging plain text.  This is essentially machine language you're writing 
here, and there are entirely different venues for developing this kind of 
thing.  

Secondly, I have virtually no idea what problem this is attempting to solve 
unless it's attempting to embed a text rendering engine within plain text.  If 
so, it's both entirely superfluous (there are already projects to provide for 
cross-platform support for text rendering) and woefully inadequate and 
underspecified.  Even if this were sufficient to be able to draw a currently 
unencoded script, the fact of the matter is that it doesn't allow for doing 
anything with the script other than drawing.  (Spell-checking?  Sorting?  
Text-to-speech?)

Unicode and ISO/IEC 10646 are attempts to solve a basic, simply-described 
problem:  provide for a standardized computer representation of plain text 
written using existing writing systems.  That's it.  Any attempt to use the two 
to do something different is not going to fly.  Creating new writing systems, 
directly embedding language, directly embedding mathematics or machine 
language--all of these are entirely outside of Unicode's purview and WG2's 
remit.  They simply will not be adopted.

Your enthusiasm may be commendable, but you're spending your energy developing 
something which is not appropriate for inclusion within Unicode.





Re: Preparing a proposal for encoding a portable interpretable object code into Unicode (from Re: IUC 34 - call for participation open until May 26)

2010-06-02 Thread John H. Jenkins

On Jun 2, 2010, at 3:51 AM, William_J_G Overington wrote:

 I know of no reason to think that a person skilled in the art would be 
 unable to write an iPad app to receive a program written in the portable 
 interpretable object code arriving within a Unicode text message and then for 
 the program to run in a virtual machine within the app, displaying a 
 graphical result on the screen of the iPad. Could such an app be written 
 based on the information in the paper_draft_005.pdf document? 
  

OK, one very last note.  The answer to this question is, No.  

=
John H. Jenkins
jenk...@apple.com



Re: Preparing a proposal for encoding a portable interpretable object code into Unicode (from Re: IUC 34 - call for participation open until May 26)

2010-06-02 Thread John H. Jenkins

On Jun 2, 2010, at 3:51 AM, William_J_G Overington wrote:

 
 Unicode and ISO/IEC 10646 are attempts to solve a basic,
 simply-described problem:  provide for a standardized
 computer representation of plain text written using existing
 writing systems.
 
 Well, that might well be the case historically, yet then the emoji were 
 invented and they were encoded. The emoji existed at the time that they were 
 encoded, yet they did not exist at the time that the standards were started. 
 So, if the idea of the portable interpretable object code gathers support, 
 then maybe the defined scope of the standards will become extended.

*If* the idea of a portable, interpretable object code embedded in plain text 
garners support and actual implementation outside of Unicode itself, then yes, 
it's conceivable that the UTC might consider it.  Emoji were encoded because 
they were already widely implemented in Japanese cell phones.  If the emoji set 
had been submitted to the UTC as is *without* prior, widespread implementation, 
it would likely not have been approved.

And in any event, Unicode already included significant collections of dingbat 
and dingbat-like elements and has from the first.  Whatever one may feel about 
the merits of encoding this particular set, the fact is that there was ample 
precedent already there.  Encoding emoji did not alter the τό τί ἦν εἶναι, the 
essence, of the standard.   

 
 That's it.  Any attempt to use
 the two to do something different is not going to fly.
 
 Well, I appreciate that the use of the phrase not going to fly is a 
 metaphor and I could use a creative writing metaphor of it soaring on 
 thermals above olive groves, yet to what exactly are you using the metaphor 
 not going to fly to refer please?

I mean that there is no chance at all that the UTC would approve this proposal 
as matters stand, and that pursuing such a concept through Unicode channels is 
a waste of everybody's time, yours not excepted. If you seriously want to get 
such a radical redefinition of plain text included in Unicode, you'll need to 
start elsewhere.  

And I don't have time myself to really comment further than I already have.

=
John H. Jenkins
jenk...@apple.com





Re: A question about user areas

2010-06-02 Thread John H. Jenkins

On Jun 2, 2010, at 3:49 AM, Vinodh Rajan wrote:

 If there are similar projects that encode Ancient Characters in PUA, may be 
 you can co-ordinate with them. Similar to the ConScript Unicode Registry.
  

There is a proposal for Old Hanzi being worked on by the IRG.  You can peruse 
the IRGs documents on the subject at their Web site, 
http://appsrv.cse.cuhk.edu.hk/~irg/.

=
John H. Jenkins
jenk...@apple.com





Re: Hexadecimal digits

2010-06-04 Thread John H. Jenkins
Unicode has Roman numerals for compatibility reasons, not for serious use as 
Roman numerals. If you *really* want to work with roman numerals, even in the 
year MMDCCLXIII AUC, use the letters, just like the Romans did.

And in any event, you're undermining your own case, because a *lot* of 
societies have used the same symbols for letters and numerals.  People learn to 
live with it, just the way we live with cough and slough, minute and minute, 
and 1750 hours and 1750 days.  This is where gematria had its start.

從我的 iPhone 傳送

在 Jun 4, 2010 12:39 PM 時,Luke-Jr l...@dashjr.org 寫到:

 Unicode has Roman numerals and bar counting (base 0); why should base 16 be 
 denied unique characters?
 
 From another perspective, the English-language Arabic-numeral world came up 
 with ASCII. Unicode was created to unlimit the character set to include  
 coverage of other languages' characters. Why shouldn't a variety of numeric 
 systems also be supported?
 
 




Re: Hexadecimal digits

2010-06-04 Thread John H. Jenkins

On Jun 4, 2010, at 2:48 PM, Luke-Jr wrote:

 The computer industry already has units of 'kilobyte' and such referring to 
 powers of 1024. 
 

You mean, of course, kibibyte.  A kilobyte is 1000 bytes.  






Re: Overloading Unicode

2010-06-07 Thread John H. Jenkins

On Jun 7, 2010, at 2:48 AM, William_J_G Overington wrote:

 I am hoping to submit a document to the Unicode Technical Committee in the 
 hope that the Unicode Technical Committee will institute a Public Review.
 

I don't believe that the UTC will institute a Public Review on this proposal 
because it is so patently outside the scope of the Unicode Standard.  

 I feel that the possibility of the Unicode Technical Committee instituting 
 such a Public Review would be increased if there were support for such a 
 Public Review to take place.
 

If there were support, the possibility might be increased from 0% to 0.001%.  
But there isn't any support.  

 I feel that a Public Review conducted by the Unicode Technical Committee 
 would be a good way to decide whether to encode a portable interpretable 
 object code into Unicode.
 

Public Reviews aren't intended to help the UTC decide whether or not a 
particular proposal is within the scope of the standard.  

Nobody's stopping you from submitting a proposal, but bear in mind that nobody 
on this list has shown any support for it and you have been told repeatedly by 
a number of people that it's outside of Unicode's scope.  There is absolutely 
no chance that the UTC will do anything on this proposal other than reject it.

This really isn't the proper venue to pursue the proposal, and you're wasting 
your time by doing so.  Implement it, get support for it, get it adopted 
outside of a narrow group of supporters.  If there is a *demonstrated* problem 
that this is a *demonstrated* solution for, then *maybe* the UTC would look at 
it.  Until then, discussing the proposal here is simply tilting at windmills.  

=
John H. Jenkins
jenk...@apple.com






Re: Octal

2010-06-07 Thread John H. Jenkins
For me, the biggest advantage for octal is that you can still count easily on 
your fingers.  (And yes, I do count on my fingers.  I also still use a slide 
rule and have been known to do long division in Roman numerals.)

On Jun 5, 2010, at 11:16 AM, Jonathan Rosenne wrote:

 When I started using computers we used octal, so I suggest new characters for 
 the octal digits “0”, “1”, “2”, “3”, “4”, “5”, “6”, “7”.
  
 BTW, octal has all the benefits claimed for hexadecimal with the advantage 
 that it is much simpler.
  
 Jony
  
 From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On 
 Behalf Of Peter Constable
 Sent: Saturday, June 05, 2010 6:45 PM
 To: Unicode Discussion
 Subject: base-9 digits
  
 Can we please encode new characters for base-9 digits “0”, “1”, “2”, “3”, 
 “4”, “5”, “6”, “7”, “8”?
  
  
  
 Peter

=
John H. Jenkins
jenk...@apple.com




Re: Hexadecimal digits

2010-06-09 Thread John H. Jenkins
Both a decimal 2 and a hexadecimal 2 are an ideogram representing the abstract 
concept of two-ness, and the latter is derived typographically from the 
former (and, indeed, currently looks exactly like it).  This is comparable to a 
Chinese 二 and a Japanese 二, which we've unified.

Unicode encodes characters, not glyphs.  In order to separately encode a 
hexadecimal-2 separately from an decimal-2, you'd either have to show either 
that the two are, in fact, inherently different characters (in which case you'd 
better be prepared to separately encode the octal-2 and the duodecimal-2 et 
al.), or you'd have to two that widespread existing practice treats them as 
distinct or at least draws them distinctly.  

(And before anybody raises the objection, nobody treats the Chinese 二 and 
Japanese 二 as distinct.  There are other sinograms which look different when 
designed for Chinese use and Japanese use and some people would like to treat 
them as distinct for that reason, but historically and in current practice, 
this is not actually done.)

Indeed, current practice universally treats decimal-0 through decimal-9 as 
hexadecimal-0 through hexadecimal-9 and letter-A/a through letter-F/f as 
hexadecimal-10 through hexadecimal-15.  That practice would have to change 
before any serious attempt at encoding hexadecimal digits would be 
considered.  And using letters for numerals has a long and distinguished 
history despite the inherent ambiguities, so there is ample precedent for the 
current practice.

Yes, this does create a chicken-and-egg problem, and whether or not this will 
have a long-term impact on the creation or adoption of new alphabets or new 
typographic practice is an interesting one.  That, however, is irrelevant to 
how Unicode does things.  

In re the tonal system specifically, I note that it uses a glyph for 
hexadecimal-10 which looks (to me, at least) identical with a glyph for 
decimal-9.  This IMHO represents a serious impediment  to the system ever being 
adopted.  I will, however, gladly be proven wrong.

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Are Unihan variant relations expected to be symmetrical?

2010-06-29 Thread John H. Jenkins
The kZVariant field has bad data in it that we haven't had time to clean up.  
It should, in theory, be symmetrical, and it should, in theory, contain only 
unifiable forms, but as you note, it doesn't.  In addition to the use of the 
source separation rule, it should also cover characters which were added to the 
standard in error.  

In any event, I'm afraid that right now it's probably best not to rely on it 
for anything.

On Jun 29, 2010, at 8:25 AM, Uriah Eisenstein wrote:

 Hi,
 To clarify my question with an example :) The character 亀 (U+4E80) is listed 
 in Unihan as a Z-variant of 龜 (U+9F9C). However, the opposite is not true. 
 Similarly, 疍 (U+758D) is listed as a semantic variant of 蛋 (U+86CB), but not 
 vice versa. From the definitions of these variant types in UAX#38, one would 
 naturally expect them to be symmetrical, and both characters to show each 
 other as variants. There are quite a few other such cases, although it does 
 appear that in most cases the relation is symmetrical.
 My reason for asking, BTW, is that I'm thinking of grouping characters which 
 are Z-variants of each other in some application, so I need to understand 
 whether Z-variants are expected to have clear cliques in which each 
 character is a Z-variant of all others.
 I realize that the semantic variant relation, at least, is based on external 
 sources and not determined by Unicode; regarding Z-variants I'm not clear. 
 I'd like to know though whether the relation is expected to be symmetrical, 
 and the above cases are to be considered errors; or there is some meaning to 
 a one-directional relation; or something else.
 On a side note, some Z-variants I've looked at seem to have very different 
 abstract shapes, in some cases looking more like simplified/traditional 
 pairs. As I said I don't know clearly how they are determined. Are they 
 supposed to be exactly those pairs which would be unified if it were not for 
 the Source Separation Rule?
 
 TIA,
 Uriah

=
John H. Jenkins
jenk...@apple.com




Re: 001B, 001D, 001C

2010-07-07 Thread John H. Jenkins
I see Escape used (or at least the esc key on my keyboard) in a lot of 
applications still as a kind of get me out of here key.  And it's used a lot 
by emacs as the meta key, IIRC.

On Jul 7, 2010, at 9:00 AM, Michael S. Kaplan wrote:

 Not for any terribly interesting reason, but mainly for all kinds of
 ancient features like I mention here:
 
 http://blogs.msdn.com/b/michkap/archive/2008/11/04/9037027.aspx
 
 and here:
 
 http://blogs.msdn.com/b/michkap/archive/2007/05/28/2954171.aspx
 
 Michael
 
 Hello!
 
 001B, 001D, 001C are present in some keyboard layouts. What are these
 characters used for?
 
 
 
 
 

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com






Re: Status of Unihan

2010-07-12 Thread John H. Jenkins
We hope to have it back in the next few days.

On Jul 12, 2010, at 8:34 AM, Martin Heijdra wrote:

 When will Unihan be back? It has been down for quite a while now, and there 
 are librarians for whom checking this is part of their workflow…
  
 Martin

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com




Re: ? Reasonable to propose stability policy on numeric type = decimal

2010-07-26 Thread John H. Jenkins

On Jul 24, 2010, at 7:09 PM, Michael Everson wrote:

 On 25 Jul 2010, at 02:02, Bill Poser wrote:
 
 As I said, it isn't a huge issue, but scattering the digits makes the 
 programming a bit more complex and error-prone and the programs a little 
 less efficient.
 
 But it would still *work*. So my hyperbole was not outrageous. And nobody has 
 actually scattered them.
 

The set of Chinese numerals used in decimal notation is rather spectacularly 
scattered.

(FWIW I'm on the Yes, it's *very* useful, and yes, it's the way we should do 
it wherever possible, but no, a formal policy is probably not best camp.)

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Indian new rupee sign

2010-07-30 Thread John H. Jenkins

On Jul 30, 2010, at 5:01 AM, William_J_G Overington wrote:

 Is there any good reason why people cannot arrange that the new symbol is 
 fully encoded into Unicode and ISO 10646 by 31 December 2010, that is, before 
 the end of the present decade, ready to use in the next decade?
 
 If there is progress over getting the encoding done, then maybe other people 
 will join in the effort and update fonts and whatever else needs updating by 
 the same date.
 

Unicode is a complex standard whose structure involves code charts, data files, 
and various standard annexes and reports.  Any change to the standard involves 
changes to at least some of these, if not all of them.  This work is done by 
several individuals scattered around the world.  Time is needed to make sure 
the changes are properly coordinated and made with due care.  

WG2 is governed by ISO rules.  ISO is a large organization and involves 
national bodies from all over the globe.  The ISO voting process involves 
several rounds in order to make sure that any objections are properly discussed 
and responded to.  Even in the age of electronic communications, this takes 
time.

And many of the people involved in both UTC and WG2 have substantial 
responsibilities in addition to character encoding work.  (Some, indeed, do the 
character encoding work on their own time.)  It's not necessarily easy for them 
to find the time to look everything over carefully.

All of this is done at a deliberate pace because experience has taught that 
inasmuch as *any* change may have unintended consequences, making even a small 
change quickly may prove to create more problems than it solves.  

Note, for example, the early adopter who simply slapped support for the new 
rupee symbol by overlaying it on top of `.  For a lot of people, that's a cool 
solution because it means that everything works *right* *now*.  The problem is 
that it breaks a lot of other things that the person in question (and his 
supporters) obviously didn't even think of, and now they've got a pile of 
unintended consequences.  

Obviously this is an important new symbol, and I'm sure that WG2 and the UTC 
will make every effort to encode it as expeditiously as possible.  As for 
exactly how long it will take, neither WG2 nor the UTC has even *met* since 
this hit the news.  While it's exciting to have the new symbol, and while one 
does want to strike while the iron is hot, ten years from now it won't have 
made much difference whether it was encoded in 2010 or 2011--unless the job got 
botched through over-haste.

Festina lente.

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Most complete (free) Chinese font?

2010-07-30 Thread John H. Jenkins
The Han Nom fonts cover everything through Extension B and look OK.  They're 
TrueType.

On Jul 30, 2010, at 1:41 PM, jander...@talentex.co.uk wrote:

 Does anybody know what the most complete, Chinese font is called? This is for 
 Linux, but I think I can use just about any format. I know about the one 
 called Unifont, which is possibly as ugly as one can make it :-) so I was 
 hoping to find something a little bit nicer.
 
 The problem I have is that there are so many holes in most of the fonts, and 
 it seems to be quite hard to judge which font is more complete. Are there any 
 tools around that could show this - perhaps something that could tell how 
 many glyphs are defined in a given interval?
 
 

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Unihan is back, but...

2010-08-03 Thread John H. Jenkins

On Aug 3, 2010, at 12:00 PM, Robert Abel wrote:

 On 2010/08/03 18:17, John H. Jenkins wrote:
 
 Thanks for the report; it's been fixed.  
 
 BTW, problems with the Unihan database should be reported via 
 http://www.unicode.org/reporting.html.  They're less likely to slip through 
 the cracks that way.
 Speaking of slipping through cracks. Are there any plans to update the 
 reference glyphs for all Han characters added after approximately Unicode 
 3.1? I filed an error report on said page some time ago and got back that 
 Unicode just didn't get around to producing them. So is there an estimate on 
 when that will be the case?
 

Alas, no.  We do still plan to do this, but we can't give any sense of when it 
will be done.

=
John H. Jenkins
jenk...@apple.com




Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-06 Thread John H. Jenkins

On Aug 6, 2010, at 3:03 AM, William_J_G Overington wrote:

 The standards organizations have a great opportunity to advance typography by 
 defining some of the Latin letter plus variation selector pairs so that 
 alternate glyphs within a font may be accessed directly from plain text.
 

This is another case of a solution in search of a problem.  It isn't Unicode's 
business to advance typography, and in any event, typesetting plain text isn't 
the path to good typography.  Other technologies, such as OpenType, AAT, and 
Graphite, *do* have the job of making good typography easy and accessible.  
And, mirabile dictu, they can already do what you are suggesting here for plain 
text.  

Unicode's responsibility is to deal with existing needs.  If it is common for 
poets to use various letter shapes at the end of words to convey some semantic 
meaning, and if they do this in their emails or tweets, or if they're 
complaining that this is something that they want to do but can't, then Unicode 
and plain text provide a proper way to help them.  

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: Accessing alternate glyphs from plain text (from Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters)

2010-08-09 Thread John H. Jenkins

On Aug 7, 2010, at 10:40 AM, Doug Ewell wrote:

 I'd like to see an FAQ page on What is Plain Text? written primarily by UTC 
 officers.  That might go a long way toward resolving the differences between 
 William's interpretation of what plain text is, which people like me think is 
 too broad, and mine, which some people have said is too narrow.
 

Well, we do have http://www.unicode.org/faq/ligature_digraph.html#10 and 
related FAQs?

The basic idea is that plain text is the minimum amount of information to 
process the given language in a normal way.  FOR EXAMPLE, ALTHOUGH ENGLISH 
CAN BE WRITTEN IN ALL-CAPS, IT USUALLY ISN'T, AND DOING IT LOOKS WRONG.  We 
therefore have both upper- and lower-case letters for English.  On the other 
hand, although English *is* usually written with some facility to provide 
emphasis, different media have different ways of providing that facility 
(asterisks, underlining, italicizing), and English written without any of these 
looks perfectly fine.  

Arabic, on the other hand, absolutely must have some way of allowing for 
different letter shapes in different contexts, or it looks just wrong, so 
Arabic plain text must have facility to allow for that, either by explicitly 
having different characters for the different shapes the letters take, or by 
providing a default layout algorithm that defines them.  

Beyond rendering, there are also considerations as to the minimal amount of 
information necessary for other text-based processes, such as sorting, 
searching, and text-to-speech.

Yes, there are issues which end up being judgment calls, and it's easy to come 
up with cases where you can't really capture the full semantic intent of the 
author without what Unicode calls rich text.  My favorite example is The 
Mouse's Tale in _Alice in Wonderland_.   Plain text isn't intended to capture 
all the nuances of the original's semantics, but to provide at the least a very 
close approximation.

Variation selectors are intended to cover cases where more information is 
needed for rendering than is required for other processes such as searching 
(Mongolian), or cases where different user communities disagree on whether two 
forms must be unified or must be deunified.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: Accessing alternate glyphs from plain text

2010-08-11 Thread John H. Jenkins

On Aug 11, 2010, at 8:18 AM, Doug Ewell wrote:

 But to imply that because text always has a specific appearance, determining 
 the underlying plain text is an artificial process that was imposed on us by 
 computers seems wrong.  We (meaning readers of alphabetic scripts, at least 
 Latin and Cyrillic) learn to recognize letters at an early age, but quickly 
 run into additional glyphs we don't recognize, like certain cursive uppercase 
 letters (especially G and Q) and the two-tier vs. one-tier lowercase a and g. 
  Then we find out they are different forms of the same letter, and learn to 
 read them the same, and that is the essence of plain text—the underlying 
 letters behind potentially differing glyphs.
 

Just to illustrate Doug's point, suppose someone hands you a hand-written 
letter and asks you to copy it.  To what extent do you attempt to fully 
recreate the format of the original?  Most likely, you'll simply copy the 
letters and punctuation.  If the letter has some specific formatting (such as 
underlining), you may attempt to recreate that.  By and large, however, there 
would be no effort to recreate the non-paragraphing line breaks and definitely 
not any effort to recreate the original letter shapes.  Copying the letter in 
this fashion is certainly acceptable under almost all circumstances--indeed, in 
many cases it would be preferred over, say, a photocopy--and it strongly 
suggests the existence of some sort of Platonic plain text which is the 
essence of what was written.

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com






Re: Accessing alternate glyphs from plain text

2010-08-12 Thread John H. Jenkins
You seem to be missing a couple of important points here which Peter is 
illustrating.

First of all, what you want to do can be done with existing technology.  
There's no need to add variation selectors or other mechanisms to achieve your 
goal.

Secondly, fonts are themselves works of art, and a well-designed face will have 
a set of swashes appropriate face but not necessarily another face.  Simply 
saying I want a swash here isn't enough.  On a Mac, for example, Hoefler Text 
Italic has one swash available for the t, whereas Zapfino has three, none of 
which are like the swash Hoefler Text Italic provides, and one of which is 
inappropriate for use at the end of a line.  Most fonts won't have any, because 
swashes are usually seen as the purview of calligraphic fonts.  

So what do you do?  Do you provide a variation selector for every kind of swash 
a font designer might include to make sure you get the right one?  Or do you 
just say, Put a swash in here, I don't care what it looks like?  Neither 
seems like a good idea.  

Note, too, that Peter used swashes where you didn't ask for them.  Since we're 
trying to embody the swashing in plain text, doesn't that mean that he's 
violating what the poet was intending to say?

When you're doing real-life typography, it's really meaningless to talk about 
alternate glyph shapes without knowing what font you're working with.  

Typography is not done with plain text.  

Just to illustrate *my* point, I'm adding a PDF of four of the huge number of 
possibilities for laying out your first stanza with Zapfino on a Mac.  Which 
one did the poet intend?



Poem.pdf
Description: Adobe PDF document


On Aug 12, 2010, at 5:38 AM, William_J_G Overington wrote:

 Thank you for taking the time to produce the pdf and thank you also for 
 sharing the result.
 
 I had not known of the Gabriola font previously.
 
 I found the following page on the web.
 
 http://www.microsoft.com/typography/fonts/family.aspx?FID=372
 
 Best regards
 
 William Overington
 
 12 August 2010
 
 On Thursday 12 August 2010, Peter Constable peter...@microsoft.com wrote:
 
 See the attached PDF showing Unicode
 5.2 text set in Word 2010 using the Gabriola font with
 line-ending characters formatted with the Stylistic Set 7
 OpenType Feature. No PUA; no variation selectors. Just
 flourishing, OpenType glyphs.
 
 
 Peter
 
 
 
 

=
John H. Jenkins
jenk...@apple.com




Re: U-Source ideographs mapped to themselves

2010-08-30 Thread John H. Jenkins

On Aug 29, 2010, at 6:07 AM, Uriah Eisenstein wrote:

 Hi,
 UAX #38 (Unihan) defines the kIRG_USource field as a reference into the 
 U-source ideograph database described in UTR #45, having the form UTCn. 
 However, several CJK Compatibility Ideographs are mapped to their own code 
 point values, e.g. U+FA0CkIRG_USourceU+FA0C. The formal syntax of 
 kIRG_USource allows this, but I've found no explanation as to the meaning of 
 such a mapping; there is also no such mapping from a code point to another 
 code point.
 Thanks,
 Uriah


This is being changed with the 6.0.0 release.  The U-source for all such 
ideographs has been turned into a UTR #45 index, e.g., the U-source for U+FA0C 
is now UTC00915.  

What it means is that the character is a unifiable variant derived from one of 
the industrial (and not national) sources used by Unicode during the 
development of the original URO.   

=
John H. Jenkins
jenk...@apple.com




Re: Unihan SQL access

2010-09-12 Thread John H. Jenkins
I'll raise the possibility with the appropriate individuals, but I think it 
likely that the Consortium would prefer that third parties not host clones of 
the Unihan database.  

On Sep 12, 2010, at 9:57 AM, Uriah Eisenstein wrote:

 Hello,
 I'm nearing completion of a simple Java program which loads Unihan data from 
 the source files into a DB, and provides SQL access to it.There's still at 
 least a week or so of work on issues I consider essential, but once ready I'd 
 be happy to make it available on the Internet if anyone's interested.
 So far I've used it to search for possibly erroneous data in Unihan; my 
 latest find is that 73 characters have a kTaiwanTelegraph value of , 
 which seems doubtful. It can also be useful for various statistical 
 information such as how many characters are listed under each radical, or 
 which blocks include IICore characters.
 I'm also considering adding the contents of the Unicode Character Database as 
 well at a later phase.
 Regards,
 Uriah Eisenstein

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com




Re: Creative people on Twitter

2010-10-14 Thread John H. Jenkins

On Oct 14, 2010, at 4:12 AM, William_J_G Overington wrote:

 What is the position regarding the 32-bit code point space above U+10 
 please?
 

Its use is incompatible with Unicode.  Fundamentally, it cannot be represented 
using UTF-16 (without a major rearchitecture), so it doesn't exist.

 Does the Unicode Consortium and/or ISO or indeed anyone else make any claims 
 upon it?
 

Yes, the claim is that if you use it, you're generating invalid Unicode.  

Don't do it, don't contemplate it, don't think about it.  

=
John H. Jenkins
jenk...@apple.com




Re: Errors in Unihan data : simplified/traditional variants

2010-11-01 Thread John H. Jenkins

On 2010/10/30, at 下午8:42, Koxinga wrote:

 My quickly done parsing program counted 1154 such pairs, where the head 
 character was the same as the character above. It seems to be always in the 
 order kTraditionalVariant then kSimplifiedVariant, so can maybe be 
 automatically corrected. It seems to be a very evident mistake, and the 
 correction should be easy. I can help with that, I am just waiting to see if 
 this is the right place to report problems in Unihan. I also 
 consideredhttp://www.unicode.org/reporting.html , would it be better ?
 

Yes, that would be better.  That way it will be tracked and it's less likely to 
slip through the cracks in my schedule.  For general questions, you can email 
me directly.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: ch ligature in a monospace font

2011-06-28 Thread John H. Jenkins

On 28 Jun, 2011, at 11:29 AM, Jean-François Colson wrote:

 * In the C’HWERTY layout on Linux, the digraph and trigraph had to be 
 replaced by six PUA characters and an input method such as xim must be used 
 to get the correct character sequences. Since they are PUA characters, those 
 substitutions are not installed by default and the user has to add them 
 him/herself in his/her ~/.XCompose file. I’ve made a bug report at 
 Freedesktop.org to ask 6 new keysyms, but I don’t know when I’ll get an 
 answer if I get one at all. If there were Unicode characters such as LJ Lj lj NJ 
 Nj nj etc. for ch and c’h, such a problem wouldn’t occur.
 

Why do you need to process them as single characters?  The typical way of 
handling these things is to use multiple characters, as is done in Welsh for 
dd, ff, and ll (among many other examples from many other languages).  
This is a well-known problem and with modern systems, there's no aspect of text 
processing that can't be handled this way. Keyboards can emit multiple 
characters with one keystroke, sorting can be tailored to account for 
multiple-character letters, and so on.  

 * Since those two letters must be encoded in 2 or 3 characters, with a 
 monospace font, they are twice or 3 times larger than the other letters.
 
 To solve this last problem, would it be possible to make a font in which c 
 ZWJ h would be displayed as a new glyph?
 

Yes, it's fairly trivial to do.  

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Quick survey of Apple symbol fonts (in context of the Wingding/Webding proposal)

2011-07-15 Thread John H. Jenkins
I'll try to arrange for an official corporate response to this document for the 
next UTC, but informally, I note that the charts include a number of variants 
of the Apple corporate logo, which Apple wants *not* to be encoded in any form. 
 

Beyond this—and speaking purely for myself and not for Apple (and unfortunately 
aware that some people don't understand or will not respect the distinction)—I 
think that this whole discussion is starting up a little too quickly.  The mere 
fact that they're in fonts some corporation ships is not evidence that they are 
appropriate even for consideration, let alone encoding, particularly in the 
absence of clones or other widely-distributed fonts which contain these glyphs. 
 I think it's fair to say that if Apple felt that these glyphs were needed in 
general text interchange, Apple would have proposed them.  

In any event, I would personally prefer that the whole discussion be dropped 
until Apple has had a chance to at least look over the document and respond.  
To do otherwise strikes me as at the least discourteous and at best premature.  

=
井作恆
John H. Jenkins







Re: Endangered Alphabets

2011-08-19 Thread John H. Jenkins
I think you want ISO 2022.  

In any event, this will never happen in Unicode, because this is the exact 
opposite of what Unicode is all about, unless I misunderstand you.  Unicode's 
goal is for every code unit to have a fixed interpretation.  So far as many 
people involved in the original design of Unicode, code pages were a disaster.  

srivas sinnathurai 於 2011年8月19日 上午7:14 寫道:

 PUA is not structured and not officially programmable to accommodate numerous 
 code pages.
  
 Take the ISO 8859-1, 2, 3, and so on .
 These are now allocating the same code points to many languages and for other 
 purposes.
 Similarly, a structured and official allocations to any many requirements can 
 be done using the same codes, say 16,000 of them.
  
 Sinnathurai
 
 On 19 August 2011 13:53, Doug Ewell d...@ewellic.org wrote:
 In what way is this not what the PUA is all about?
  
 --
 Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
 www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell ­
 From: srivas sinnathurai
 Sent: Friday, August 19, 2011 5:13
 To: Michael Everson
 Cc: unicode Unicode Discussion ; unicore UnicoRe Discussion
 Subject: Re: Endangered Alphabets
  
 This is about time we allocate a significant space withi the Unicode code 
 space to work in the old fashion code page provisioning mode.
  
 I'm not calling for any change to existing major aloocations. However, this 
 is about time we allocate (not PUA) large number of codes to a code page 
 based sub codes so that not only all 7000+ languages can Freely use it 
 without INTERFERENCE from Unicode and have the freedom to carry out research 
 works, like we were doing with the legacy 8bit codes.
  
 All those in favour of creating code pages, please say yes, and others please 
 say why not.
  
 Kind Regards
 Sinnathurai Srivas
 On 19 August 2011 10:55, Michael Everson ever...@evertype.com wrote:
 I'd like to invite everyone to support this worthwhile project:
 
 http://www.kickstarter.com/projects/1496420787/the-endangered-alphabets-project/
 
 Michael Everson * http://www.evertype.com/
 
 
 
  
 

=
井作恆
John H. Jenkins
jenk...@apple.com





Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

2011-08-19 Thread John H. Jenkins

srivas sinnathurai 於 2011年8月19日 上午9:40 寫道:

 Why this suggestion?
 With current flat space, one code point is only allocated to one and only one 
 purpose.
 We can run out of code space soon.
 


There are a couple of problems here.

We currently have over 860,000 unassigned code points.  Surveys of all known 
writing systems indicate that only a small fraction of these will be needed.  
Indeed, although it looks likely that Han will spill out of the SIP into plane 
3, all non-Han will likely fit into the SMP.  (Michael, you can correct me on 
this if I'm wrong.)

Even if we allow for the possibility that there are a lot of writing systems 
out there we don't know about, there would have to be a *lot* of writing 
systems out there we don't know about to fill up planes 4 through 14.  If the 
average script requires 256 code points, there would have to be some 2800 
unencoded scripts to do that.  

Moreover, it's taken us 20 years to use 250,000 code points.  Even if that rate 
remained steady (and it's been going down), it will take us something on the 
order of a century to fill up the remaining space, if that's even possible, and 
that hardly qualifies as soon.

And there already is a code page switching mechanism such as you propose.  It's 
called ISO 2022 and it supports Unicode.  

In order to get the UTC and WG2 to agree to a major architectural change such 
as you're suggesting, you'd have to have some very solid evidence that it's 
needed—not an interesting idea, not potentially useful, but seriously *needed*. 
 That's how surrogates and the astral planes came about—people came up with 
solid figures showing that 65,536 code points was not nearly enough.  So far, 
the evidence suggests that we're in no danger of running out of code points.  

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com




Re: RTL PUA?

2011-08-19 Thread John H. Jenkins

Michael Everson 於 2011年8月19日 上午11:15 寫道:

 On 19 Aug 2011, at 18:01, Shriramana Sharma wrote:
 
 Even though it isn't encoded?  That is, my understanding is that we *can't* 
 change the PUA to ON now, but that there is a suggestion that some *new* 
 hunk of PUA be created that is R, in order to balance the existing L. Is 
 that right?
 
 Right, Michael is suggesting that, but since the properties of the PUA 
 characters aren't binding as said above, this is also unnecessary.
 
 Saying that does not make it possible for people to use PUA characters with 
 RTL directionality, since all the OSes treat them as LTR.
 

Mac OS has a mechanism to override that default assumption, the 'prop' table.  
And hopefully people support RLO and LRO properly, which provides a 
general-purpose mechanism.  

 Would mean yet another chunk of space where we aren't allowed to encode 
 anything. (Yes yes I know all that about plenty of space, but that space 
 gets filled up pretty quickly. I predict/expect the SMP will be filled soon.)
 
 Put a RTL PUA zone in Plane 14, which is mostly empty, and expected to remain 
 so, and you're done. 
 

No, you're not, because the OSs/rendering engines would have to rev, and to be 
honest, there won't be a lot of enthusiasm for doing something something like 
this so long as it isn't actually *required* in order to be Unicode conformant. 
(It's hard enough to get people to do the required stuff.) RTL, PUA support, 
and optional features are usually pretty low on most people's priority lists. 

I'm very sympathetic with the frustration people feel over the current 
situation, but, again, before you could convince the UTC to do this, you'd have 
to present pretty solid evidence that  the current solution doesn't work and 
that this would.  

=
John H. Jenkins
jenk...@apple.com






Re: Code pages and Unicode

2011-08-19 Thread John H. Jenkins

Benjamin M Scarborough 於 2011年8月19日 下午3:53 寫道:

 Whenever somebody talks about needing 31 bits for Unicode, I always think of 
 the hypothetical situation of discovering some extraterrestrial civilization 
 and trying to add all of their writing systems to Unicode. I imagine there 
 would be little to unify outside of U+002E FULL STOP.

Oh, I imagine they'll have one or two turtle ideographs.  :-)

Seriously, though, if and when we run into ETs with all their myriad writing 
systems, I really don't think that we'll be Unicode to represent them.

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Code pages and Unicode

2011-08-22 Thread John H. Jenkins

Christoph Päper 於 2011年8月20日 上午2:31 寫道:

 Mark Davis ☕:
 
 Under the original design principles of Unicode, the goal was a bit more 
 limited; we envisioned […] a generative mechanism for infrequent CJK 
 ideographs,
 
 I'd still like having that as an option.
 


Et voilà!  We have Ideographic Description Sequences.  Or, if you're more 
ambitious, CDL.  

Generative mechanisms for Han are very attractive given the nature of the 
script, but once you try to support something other than display, or even try 
to write a rendering engine, all sorts of nasty problems crop up that have 
proven difficult to solve.  We won't even get into the problem of wanting to 
discourage people from making up new ad hoc characters for Han. 

I won't say some sort of generative mechanism will never become the preferred 
way of handling unencoded ideographs, but there is a lot of work to be done 
before that would be practical.

=
John H. Jenkins
jenk...@apple.com






Re: RTL PUA?

2011-08-22 Thread John H. Jenkins

Doug Ewell 於 2011年8月22日 上午10:59 寫道:

 Petr Tomasek tomasek at etf dot cuni dot cz wrote:
 
 Some PUA properties, like glyph shapes and maybe directionality, can
 be stored in a font.  Others, like numeric values and casing, might
 not or cannot.  An interchangeable format needs to be agreed upon for
 
 Why not?
 
 Where does one store numeric values in a font?  Maybe this should be
 taken off-list.
 


This is actually a relevant point.  The major TrueType variants all work 
primarily with glyphs, not characters.  Using them as a place to store 
information about the *characters* in the text is therefore not a reliable way 
to provide an override for default system behavior.  By the time the rendering 
engine consults the fonts for layout specifics, large chunks of the text 
processing will already be completed.  

OpenType, for example, expects that the bidi algorithm is largely run in 
character space, not glyph space, and therefore without regard for the specific 
font involved.  (AAT does almost everything in glyph space, including bidi.  
I'm not sure about Graphite.)  

The net result is that a font is an unreliable way of storing 
character-specific information useful on multiple platforms.  This is one 
reason why embedding the existing directionality controls within the text 
itself is currently the most reliable way of getting the behavior one might 
want in a platform-agnostic way.

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com






Re: RTL PUA?

2011-08-22 Thread John H. Jenkins

William_J_G Overington 於 2011年8月22日 上午10:49 寫道:

 In the Description section of the Macintosh Roman section of a TrueType font, 
 include a line of text in a plain text format of which the following line of 
 text is an example.
 
 PUA.RTL=$E000-$E1FF,$E440-$E447,$E541,$E549,$E57C,$EA00-$EA0F,$EC07;
 

Forgive my asking, but this reference to the description section of the 
Macintosh Roman section of a TrueType font has me puzzled, because I don't 
know what you're talking about.  What table contains this string?

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: RTL PUA?

2011-08-22 Thread John H. Jenkins

William_J_G Overington 於 2011年8月22日 下午12:36 寫道:

 On Monday 22 August 2011, John H. Jenkins jenk...@apple.com wrote:
 
 Forgive my asking, but this reference to the description section of the 
 Macintosh Roman section of a TrueType font has me puzzled, because I don't 
 know what you're talking about.  What table contains this string?
 
 When I use FontCreator, made by High-Logic, http://www.high-logic.com is the 
 webspace: with a font file open, I can select Format from the menu bar and 
 then select Naming... from the drop down menu.
 
 That leads to a dialogue panel.
 
 From that dialogue panel one may select, for an ordinary, basic Unicode font, 
 either of two platforms, namely Macintosh Roman and Microsoft Unicode BMP 
 only.
 
 Having selected a platform, one may view the text content of various fields 
 for that platform, such as font family name and copyright notice, version 
 string and postscript name. There is then a button that is labelled 
 Advanced... that, if clicked, opens another dialogue panel with various other 
 text fields, including Font Designer and Description, which are the two that 
 I often use.
 
 Now, when the text values in the fields are stored in the font file, the 
 values for the Macintosh Roman platform are stored in plain text and the 
 values for the Microsoft Unicode BMP only platform are stored in some encoded 
 format.
 
 So, if one opens a TrueType font file in WordPad and one searches for an item 
 of plain text that is in one of the fields of the font, then the text that is 
 in the Macintosh platform can be found, yet the text that is in the Microsoft 
 Unicode BMP only platform cannot be found.
 
 So, I thought that if a manufacturer of a wordprocessing application or a 
 desktop publishing application decided to make a special researcher's 
 edition of the software, then that software could, when a font is selected, 
 first scan the font for a PUA.RTL string and, if one is found, override the 
 left-to-right nature of the identified characters to be a right-to-left 
 nature, just while that font is selected.
 
 Whether such a software package ever becomes available is something that only 
 time will tell, yet it seems to me that it is a method that could be used 
 without needing any changes by any committee.
 

Ah.  You're referring to an entry in the 'name' table, then.  The intention of 
the 'name' table is to provide localizable strings for the UI.  Using it to 
store data of any sort for the rendering engine would be very, very 
inappropriate.  

In general, one should not be using a text editor to examine the contents of a 
TrueType font. It would be like using a text editor to examine the contents of 
an application.  Even if you see some plain text, you really don't have any 
sense for how it's actually being used.  

You may want to bone up on the structure of TrueType/OpenType fonts.

=
John H. Jenkins
井作恆
Жбь А. ЖЩэпЮьц
jenk...@apple.com







Re: RTL PUA?

2011-08-23 Thread John H. Jenkins

John Hudson 於 2011年8月23日 下午2:33 寫道:

 Behdad Esfahbod wrote:
 
 I can see the advantages of such an approach -- performing GSUB prior to 
 BiDi
 would enable cross-directional contextual substitutions, which are currently
 impossible -- but the existing model in which BiDi is applied to characters
 *not glyphs* isn't likely to change. Switching from processing GSUB lookups 
 in
 logical order rather than reading order would break too many things.
 
 You can't get cross-directional-run GSUB either way because  by definition
 GSUB in an RTL run runs RTL, and GSUB in an LTR run runs LTR.  If you do it
 before Bidi, you get, eg, kerning between two glyphs which end up being
 reordered far apart from eachother.  You really want GSUB to be applied on 
 the
 visual glyph string, but which direction it runs is a different issue.
 
 Kerning is GPOS, not GSUB.
 
 But generally I agree. My point was that Philippe's suggestion, although it 
 could be the basis of an alternative form of layout that might have some 
 benefits if fully worked out, is a radical departure from how OpenType works.
 

I'll toss in my obligatory, That's how AAT does it reference.  It has 
advantages and disadvantages—but, as you say, OT would have to be heavily 
redesigned to do it.  

=
John H. Jenkins
井作恆
Жбь А. ЖЩэпЮьц
jenk...@apple.com







Re: RTL PUA?

2011-08-24 Thread John H. Jenkins

John Hudson 於 2011年8月23日 下午9:08 寫道:

 I think you may be right that quite a lot of existing OTL functionality 
 wouldn't be affected by applying BiDi after glyph shaping: logical order and 
 resolved order are often identical in terms of GSUB input. But it is in the 
 cases where they are not identical that there needs to be a clearly defined 
 and standard way to do things on which font developers can rely. [A parallel 
 is canonical combining class ordering and GPOS mark positioning: there are 
 huge numbers of instances, even for quite complicated combinations of base 
 plus multiple marks, in which it really doesn't matter what order the marks 
 are in for the typeform to display correctly; but there are some instances in 
 which you absolutely need to have a particular mark sequence.]

And this is really the key point.  There really isn't anything inherent to 
OpenType that absolutely *requires* the bidi algorithm be run in character 
space.  It would theoretically be possible to manage things in a fashion so 
that it's run afterwards, à la AAT.  But font designers *must* know which way 
it's being done in practice, and, in practice, all OT engines run the bidi 
algorithm in character space and not in glyph space.  At this point, trying to 
arrange things so that it can be done in glyph space instead is a practical 
impossibility.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: Code pages and Unicode

2011-08-24 Thread John H. Jenkins

Asmus Freytag 於 2011年8月23日 下午2:00 寫道:

 
 Until then, I find further speculation rather pointless and would love if it 
 moved off this list (until such time).
 


That would be wonderful, because we could then turn our attention to more 
urgent subjects, such as what to do when the sun reaches its red giant stage 
and threatens to engulf the Earth. ☺ 

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com






Re: Code pages and Unicode

2011-08-24 Thread John H. Jenkins
It has ceased to be. It's expired and gone to meet its maker. It's a stiff. 
Bereft of life, it rests in peace.…Its metabolic processes are now history. 
It's off the twig. It's kicked the bucket, it's shuffled off its mortal coil, 
run down the curtain and joined the bleedin' choir invisible.  This is an 
ex-possibility.

And even if that *weren't* true, there are nowhere *near* enough kanji to have 
a serious impact on Ken's analysis.  

Richard Wordingham 於 2011年8月24日 下午4:51 寫道:

 Has Japanese
 disunification been completely killed, or merely scotched?

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Controls, gliphs, flies, lemonade

2011-09-20 Thread John H. Jenkins
In re CJK, that's already a FAQ: http://www.unicode.org/faq/han_cjk.html#16.  
The short version is: if all you want to do is to draw something, then yes, 
making up new hanzi on the fly is a solvable problem.  If you want to do 
anything that deals with the *content* (lexical analysis, sorting, 
text-to-speech), it's an incredibly difficult problem.  

And, actually, there's already a way to insert nonstandard hanzi into text 
(well, two, if you count the Ideographic Variation Indicator), namely 
Ideographic Description Sequences.  They're clumsy and awkward, but they do 
make it possible to exchange text with unencoded hanzi in a vaguely standard 
fashion.  

And yes, Unicode is very complicated, but that's because of the problem it's 
intended to solve.  If all you're interested in is drawing text in a couple of 
common scripts, such as Latin and Japanese, then you really don't need Unicode 
with all of its complexity.  Unicode is trying to provide a basis for handling 
all aspects of plain text processing for all the languages of the world in a 
single application.  

Just go to Wikipedia and look down the long list of different languages that a 
popular subject has articles in.  *That* is what Unicode is trying to provide.  
It's very tough to implement, but fortunately on all the major platforms, there 
are libraries that make it unnecessary for you to do all the work yourself.

QSJN 4 UKR 於 2011年9月20日 下午9:01 寫道:

 Yes, i had written 'egyptian hieroglyphs' but how about banal CJK? We
 still have no way to insert nonstandard ideogramme into text. Isn't it
 a simple task? There are just 20 basic strokes :)  ok, 500 basic
 symbols. Or 20? However  we can't combine it together :( !
 Unicode is to complex standard. I even don't know how many properties
 have one character (did you know about unicode-coloured characters? -
 there was somewhere that my theme in this list), how can i know how my
 application has to render 'plain' text with bidi, noncanonicordered
 diacritics, and korean script. Right, i don't know that. And my
 application render it in my way, some else in another (a_a / aa_ -
 double comb. char., sure you seen that), so we have no standard at
 all.
 Off course, i can learn this complex standard, but what for? Most of
 them i never use.
 There must be a simpler system, not so many aprior data for it work.
 
 2011/9/13, John H. Jenkins jenk...@apple.com:
 
 QSJN 4 UKR 於 2011年9月12日 下午9:06 寫道:
 
 I know it is sacred cow, but let me just ask, how do you people think.
 Is it good or bad that the codepoint means all about character: what,
 where, how... (see theme)? Maybe have we separate graph  control
 codes - wellnt have many problems, from banal ltr (( rtl instead ltr
 (rtl) to placing one tilde above 3, 4, anymore letters, or egyptian
 hierogliphs in rows'n'cols. Conceptually, I mean! Each letter in text
 is at least two codepoints (what and where) in file. Is it stupid?
 Trying to render the text we anyway must generate this data.
 
 
 
 It's not really a sacred cow per se, but it is a fundamental architectural
 decision which would be pretty much impossible to revisit now.
 
 Almost all writing is done using a small set of script-specific rules which
 are pretty straightforward.  English, for example, is laid out in horizontal
 lines running left-to-right and arranged top-to-bottom of the writing
 surface.  East Asian languages were traditionally laid out in vertical lines
 running from top-to-bottom and arranged right-to-left on the writing
 surface.
 
 Because some scripts are right-to-left and ltr and rtl text can be freely
 intermingled on a single line, Unicode provides plain-text directionality
 controls.  The preference, however, is to use higher-level protocols where
 possible.
 
 As for the scripts which are inherently two-dimensional (using
 hieroglyphics, mathematics, and music), it's almost impossible to provide
 plain text support for them.  There is too much dependence on additional
 information such as the specifics of font and point size.  Because of this,
 the UTC decided long ago that layout for such scripts absolutely must be
 done using a higher-level protocol to handle all the details.
 
 There are occasionally suggestions that positioning controls be added to
 plain text in Unicode, but so far the UTC has felt that the benefits are too
 marginal to overcome its reasons for having left them out in the first
 place.
 
 =
 Hoani H. Tinikini
 John H. Jenkins
 jenk...@apple.com
 
 
 
 
 
 
 

=
John H. Jenkins
jenk...@apple.com






Re: New version of UTR #45 published

2011-10-03 Thread John H. Jenkins

Philippe Verdy 於 2011年9月30日 下午11:32 寫道:

 What is the current status of this UTC's Extension E ?
 - If it's still not validated, then the description of the field in
 the last paragraph quoted below should not be there, but in a pending
 update of this UTR.
 - If it's approved, then the first paragraph should list E, and there
 should not be any reference to a proposal in the paragraph
 describing it.
 

It's still being looked at by the IRG along with all the other Extension E 
submissions.  It's not a part of the standard yet, and it's subject to change, 
but it is a well-defined set of interest to people who are tracking IRG work.  

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com






Re: Unihan data for U+2B5B8 error

2011-10-19 Thread John H. Jenkins

Jukka K. Korpela 於 2011年10月19日 上午3:06 寫道:

 I don’t know what issue Shi Zhao is referring to, but there is definitively 
 an error on the page. Under the heading “Glyphs,” the small table contains, 
 under the header cell “The Unicode Standard”, a cell that appears to be 
 empty. 

This is a known (and, alas, long-standing) problem.  We really do intend to get 
it fixed, but it's impossible to say when.

=
John H. Jenkins
jenk...@apple.com






Re: Unihan data for U+2B5B8 error

2011-10-19 Thread John H. Jenkins

Andrew West 於 2011年10月19日 上午4:14 寫道:

 On 19 October 2011 10:43, shi zhao shiz...@gmail.com wrote:
 The page said kTraditionalVariant of U+2B5B8 is U+9858 願.
 
 which is correct.
 
 ) said U+2B5B8 떸 is kSimplifiedVariant of U+9858 願, U+613F 愿 is
 kSemanticVariant, but 愿 is simplified of 願, not U+2B5B8 떸.
 
 which I agree is not correct.  It's not always clear how asymmetrical
 cases like this should be handled.  For U+9918 餘, which is analagous,
 with a common simplified form U+4F59 余 and an alternate simplified
 form U+9980 馀, the Unihan database lists them both as simplified
 variants of U+9918:
 
 U+9918kSimplifiedVariant  U+4F59 U+9980
 
 On this precedent, I would expect:
 
 U+9858kSimplifiedVariant  U+613F U+2B5B8
 

Actually, it's a bit more complicated than that.  Note that the 
kSemanticVariant field for U+613F is actually U+9858kFenn, which means that 
Fenn's _Five Thousand Dictionary_ lists the two as semantic variants.  (That 
should actually be U+9858kFenn:T, since Fenn indicates they are complete 
synonyms.)  Fenn is a TC-only dictionary.  Note, too, that U+613F has both 
kCihaiT and kGSR fields, also indicating that it is used in TC.  

The HYDZD entry for U+613F first gives its old, TC definition (prudent, 
cautious—愿,謹也 per the Shuowen), then it adds, today used as the simplified 
form for 願).

So U+613F is a TC character in its own right meaning one thing, as well as the 
simplification/variant of another TC character meaning something else.  What we 
should have, therefore, is:

U+613F kDefinition (variant/simplification of U+9858 願) desire, want, wish; 
(archaic) prudent, cautious
U+613F kSemanticVariant U+9858kFenn:T
U+613F kSpecializedSemanticVariant U+9858kHanYu:T
U+613F kTraditionalVariant U+613F U+9858
U+613F kSimplifiedVariant U+613F
U+9858 kSimplifiedVariant U+613F U+2B5B8
U+9858 kSemanticVariant U+9613FkFenn:T

Andrew, does that look like it covers everything correctly?  

 I suggest you report this issue on the Unicode Error Reporting form:
 
 http://www.unicode.org/reporting.html
 

Always sage advice, since you can't count on there being anybody reading this 
mailing list who can make the change.  When you do so, *please* include a 
source for your information.  We get all kinds of offered corrections to the 
Unihan data which we can't use because there's no authoritative source. 

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Unihan data for U+2B5B8 error

2011-10-20 Thread John H. Jenkins

Andrew West 於 2011年10月20日 上午3:25 寫道:

 On 19 October 2011 18:41, John H. Jenkins jenk...@apple.com wrote:
 
 U+613F kDefinition (variant/simplification of U+9858 願) desire, want, wish; 
 (archaic) prudent, cautious
 U+613F kSemanticVariant U+9858kFenn:T
 U+613F kSpecializedSemanticVariant U+9858kHanYu:T
 U+613F kTraditionalVariant U+613F U+9858
 U+613F kSimplifiedVariant U+613F
 U+9858 kSimplifiedVariant U+613F U+2B5B8
 U+9858 kSemanticVariant U+9613FkFenn:T
 
 Andrew, does that look like it covers everything correctly?
 
 Looks OK to me (except for the typo on the last line), although I
 wonder about the necessity for:
 
 U+613F kSimplifiedVariant U+613F
 
 Where a character can either traditionalify (what is the opposite of
 simplify?) to another character or stay the same then it is useful to
 have (e.g.):
 
 U+613F kTraditionalVariant U+613F U+9858
 
 But where a character does not change on simplification, is it not
 redundant to give it a kSimplifiedVariant mapping to itself ?  

Per the latest draft of UAX #38, if, when mapping from SC to TC, a character 
may change or may be left alone depending on context, it should be included in 
among its both simplified and traditional variants.  And so…

 But there are other characters that fit this paradigm that do not have
 kSimplifiedVariant mappings to themself, such as:
 
 U+5E72 干
 
 But maybe that is a reflection of this line:
 
 U+5E72kTraditionalVariant U+4E7E U+5E79
 
 which I think should be:
 
 U+5E72kTraditionalVariant U+4E7E U+5E72 U+5E79
 


Yes, this should be fixed.  If you know of any others, please let me know.

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: missing characters: combining marks above runs of more than 2 base letters

2011-11-21 Thread John H. Jenkins

Michael Everson 於 2011年11月21日 上午3:37 寫道:

 On 21 Nov 2011, at 07:23, Julian Bradfield wrote:
 
 Marking the (usually automatic) elisions is markup for elementary students.
 
 I can't think of any reason why this shouldn't be achievable in plain text. 
 Many encoded characters exist for paedagogical reasons.


Well, on a theoretical level, the issue is whether or not this is needed for 
minimal legibility, that is, whether or not the essential meaning of the text 
can be conveyed without it.  Personally, I don't think this is needed for 
minimal legibility, but that's a judgement call.

On a more pragmatic level, there's the issue of how many people would actually 
implement this, were it to become part of the standard.  This is of pretty 
marginal utility—we have, after all, managed to go for twenty years encoding 
Latin texts without it—and it would be very difficult to implement.  From a 
cost/benefit perspective, it's a pretty sure bet that virtually nobody would go 
to the trouble.

Now, granted, just because almost nobody would implement it, that doesn't mean 
that it shouldn't be part of the standard.  There's a lot in the standard 
already that is implemented but rarely, if at all. And granted, there are other 
portions of the standard which are similar enough to this that if you implement 
them, you may as well implement this, too.  Still, this strikes me as being of 
such very marginal utility that efforts to get it implemented as part of a 
plain-text standard seem pretty quixotic to me.  

(And before anybody accuses me of being overly cynical, I should point out that 
I'm probably the person putting in the greatest effort to get the Deseret 
Alphabet to be actually *used*.  How quixotic is *that*?)

=
井作恆
John H. Jenkins
jenk...@apple.com







Re: Upside Down Fu character

2012-01-03 Thread John H. Jenkins
There are really three choices:

1) Don't encode it at all and rely on higher-level protocols to display it.  
(After all, it's only used in specialized contexts and does not have a distinct 
meaning or pronunciation from the regular 福.)

2) Use a registered ideographic variation sequence to support it.  (This is 
really a variation of #1.)

3) Add it to UTR #45 and submit it to the IRG for inclusion in Extension F.  

My own feeling is that either #1 or #2 would be best, given its specialized 
nature.  

On 2011年12月30日, at 上午8:34, Andre Schappo wrote:

 The character 福 means happiness 
 http://www.mdbg.net/chindict/chindict.php?page=chardictcdcanoce=0cdqchi=福
 
 Unicode entry: U+798F  CJK UNIFIED IDEOGRAPH-798F
 
 It is customary to use an upside-down version of 福 during the Spring Festival 
 http://en.wikipedia.org/wiki/Fu_character
 
 I am considering proposing an upside-down version of 福 for inclusion in 
 Unicode. Not sure where it should go. Maybe - Enclosed Ideographic Supplement
 
 Thoughts?
 
 André 小山 Schappo
 ❀❀
 http://weibo.com/andreschappo
 http://blog.sina.com.cn/andreschappo
 http://twitter.com/andreschappo
 http://schappo.blogspot.com/
 http://me2day.net/andreschappo
 



Re: Upside Down Fu character

2012-01-12 Thread John H. Jenkins
Kang-Hao (Kenny) Lu 於 2012年1月12日 上午12:13 寫道:

 * Three folks think this is rather unnecessary (including me). Some
 people go more and say What about a code point for XXX and YYY?

Do they have specific XXXs and YYYs in mind?

In general, the process is outlined at 
http://www.unicode.org/pending/proposals.html.  For hanzi, the characters need 
to be added to UTR #45 first, but I'm going to propose that for both the 
upside-down fuk1—er, fu, and the upside-down chun, since they have been 
discussed.  UTR #45 lets us track such discussions.

=
井作恆
John H. Jenkins
jenk...@apple.com



Re: Upside Down Fu character

2012-01-13 Thread John H. Jenkins
Hanzi have a slightly different way of getting into the standard because it's 
all done through the IRG, which receives submissions from each member body. 
Submissions from the UTC start by being added to UTR #45.  That, however, is 
merely a database to track potential characters we're aware of. It doesn't mean 
that the UTC plans to request their encoding. Characters generally start out 
with a status of X, meaning that no decision has been made.

From everything I've seen so far, my own recommendation would be that the 
upside-down fu, at least, be given status W (meaning inappropriate for 
encoding).  If anybody wants to advocate encoding it, they need to write a 
document and submit it to the UTC. They would need to either provide evidence 
of actual use as a text element in plain-text (not as a graphic embedded in 
plain text—the emoji were a special case), or that it would be widely used as 
such (given a reasonable definition of widely).  The UTC might well respond 
by asking for more information.  The current submission form is certainly a 
good template for providing information in a requested hanzi.

Assuming the UTC approves a status of N (to be encoded), the character would 
be included in the UTC's submission to the IRG for Extension F.  Work on 
Extension F will likely start in 2013.

Andre Schappo 於 2012年1月13日 上午8:36 寫道:

 
 On 12 Jan 2012, at 16:54, John H. Jenkins wrote:
 
 Kang-Hao (Kenny) Lu 於 2012年1月12日 上午12:13 寫道:
 
 * Three folks think this is rather unnecessary (including me). Some
 people go more and say What about a code point for XXX and YYY?
 
 Do they have specific XXXs and YYYs in mind?
 
 In general, the process is outlined at 
 http://www.unicode.org/pending/proposals.html.  For hanzi, the characters 
 need to be added to UTR #45 first, but I'm going to propose that for both 
 the upside-down fuk1—er, fu, and the upside-down chun, since they have been 
 discussed.  UTR #45 lets us track such discussions.
 
 =
 井作恆
 John H. Jenkins
 jenk...@apple.com
 
 
 I have received a request for an upside-down 钱 (=qián = money = U+94B1).
 
 I have talked with a small number of Chinese students about having an 
 upside-down fu character and they were all enthusiastic. I will be talking 
 with more Chinese students next week which is when the new term starts.
 
 John: As you are progressing upside-down fu and chun characters into UTR #45 
 does this mean that I no longer need to submit a Proposal Summary Form for 
 upside-down fu? I have not yet actually started on said form.
 
 André 小山 Schappo
 



Re: Upside Down Fu character

2012-01-13 Thread John H. Jenkins
Asmus Freytag 於 2012年1月13日 上午11:01 寫道:

 Nobody has written a formal proposal yet.
 
 When that is done, then one of the questions that needs to be decided in 
 initial triage is whether these are elements of the han script proper or 
 iconic symbols that happen to be derived from han characters. (The proposal 
 may suggest a particular resolution of this issue). If, with all facts on the 
 table, the consensus is that they are regular han characters, then their 
 further evaluation starts with tracking them under TR#45 and potentially 
 taking them to IRG for possible consideration in extension F.

It's been suggested that one way of handling them would be as encoded hanzi. 
That's one criterion for going into UTR #45 as it is part of the paper trail of 
the UTC's decision process.  

And the UTC could always refuse to put them into UTR #45.  My job is to make 
the recommendation.

In either case, somebody other than me (that is, somebody who wants them added 
to Unicode) needs to write a document/proposal to the UTC justifying that and 
giving the options for encoding.  

=
John H. Jenkins
井作恆
Жбь А. ЖЩэпЮьц
jenk...@apple.com



Re: Unihan database

2012-04-13 Thread John H. Jenkins
Yes, this is very much possible, although I can't predict how soon we'll get it 
done.

Martin Heijdra mheij...@princeton.edu 於 2012年4月13日 上午10:26 寫道:

 Librarians are certainly a group of users using Unihan a lot, to identify 
 encodings for rare characters.
  
 Several of them have complained that it gets more and more difficult to use 
 for them. One issue is, that the database itself started to use encodings 
 rather than images; which made it impossible to find characters in versions 
 their standard SimSun fonts did not support. That of course had a solution; 
 they should now choose “use images”.
  
 But now they report that the radical-stroke page itself has changed to 
 encodings rather than images; and the radicals are not in the standard fonts. 
 Hence, the search pages (clicking on the number of strokes of the radical)  
 shows something like
  
 image001.png
  
 Can there be a change so that also these pages (based upon the number of 
 strokes of the radical) has an option to show these pages, not only the 
 result, have a “display with images” option?
  
 Martin J. Heijdra
 Chinese Studies/East Asian Studies Bibliographer 
 East Asian Library and the Gest Collection 
 Frist Campus Center, Room 314 
 Princeton University 
 Princeton, NJ 08544 
 United States

=
John H. Jenkins
井作恆
Жбь А. ЖЩэпЮьц
jenk...@apple.com





Re: A new character to encode from the Onion? :)

2012-04-30 Thread John H. Jenkins

Asmus Freytag asm...@ix.netcom.com 於 2012年4月30日 下午1:59 寫道:

 On 4/30/2012 12:27 PM, Bill Poser wrote:
 
 Digital typography has reached The Onion: 
 http://www.theonion.com/articles/errant-keystroke-produces-character-never-before-s,28030/.
 
 Quote:
 
 , it is, in all likelihood, probably just another goddamn fertility 
 symbol.terminator.gif
 
 Make that: currency symbol and ship it.
 
 

Maybe a turtle ideograph?

=
井作恆
John H. Jenkins
jenk...@apple.com





Re: Plese add a Chinese Hanzi

2012-05-28 Thread John H. Jenkins

On 2012年5月28日, at 上午10:21, Charlie Ruland rul...@luckymail.com wrote:

 Zhao,
 1. If the character 鱼⿰丹 that you would like to have encoded is a contemporary 
 Standard Chinese word or morpheme, then what is its pronunciation?

FWIW, the correct syntax is ⿰鱼丹.  I take it that he would also like ⿰魚丹.

 2. Can you provide material (for example photos, scans from books, etc.) that 
 clearly shows that 鱼⿰丹 is used as a single character? By which group of 
 people is it used?

Exactly.  *No* hanzi will be added to Unicode/ISO 10646 without solid evidence 
of actual use. Generally, this means authoritative, printed materials (a 
dictionary, government ID). Handwritten materials could conceivably be used, 
but they would have to be awfully convincing. Well-known websites with the 
character embedded *as a graphic* has been used as a past, but in those cases 
the character was quite well-known.  


 Charlie
 
 * shi zhao shiz...@gmail.com [2012-05-28 17:07]:
 PS:
  zh-hans:  鱼+丹 
 zh-hant: 魚+丹
 
 
 2012/5/28 shi zhao shiz...@gmail.com
 Plese add a Hanzi to Unihan: a fish name 鱼+丹 = Danio.
 
 see:
  https://en.wikipedia.org/wiki/Danio
 https://zh.wikipedia.org/wiki/Category:%28%E9%AD%9A%E4%B8%B9%29%E5%B1%AC
 http://www.cnffd.com/index.php?route=product/categorypath=3_11_64_284
 http://zd1.brim.ac.cn/Mnamelist.asp?start=1982
 http://hello.area.com.tw/is_bs.cgi?areacode=nt097bsid=2.9.1.1.3
 https://www.google.com/search?q=Danio+ 魚丹
 
 
 Chinese wikipedia: http://zh.wikipedia.org/
 My blog: http://shizhao.org
 twitter: https://twitter.com/shizhao
 
 [[zh:User:Shizhao]]
 
 

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com




Re: Plese add a Chinese Hanzi

2012-05-30 Thread John H. Jenkins
Making a proposal directly to the IRG isn't possible under the present 
procedures.  What's usually done for this kind of thing is to have the UTC 
propose them.  

Andrew West andrewcw...@gmail.com 於 2012年5月30日 上午8:14 寫道:

 I personally think that rather than add characters such as this
 piecemeal, it would be more useful if someone or some organization
 could research what newly devised, unencoded characters are in use in
 biology, chemistry, etc., and make a proposal to encode them all,
 either via the Chinese national body or directly to IRG.  Characters
 used in modern scientific literature should be considered urgent use,
 in my opinion, and encoded sooner rather than later.
 

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: Flag tags

2012-05-31 Thread John H. Jenkins

Michael Everson ever...@evertype.com 於 2012年5月31日 上午11:57 寫道:

 When you encode a flag for Germany and the US, you automatically get a demand 
 for the encoding of a flag for Ireland and Iceland. That's the way it is. 

tongue-in-cheek
Oh, c'mon, Michael, next you'll be saying that because some countries have 
currency symbols with decidated code points, other countries will make *new* 
currency symbols and demand that *they* get dedicated code points, too. We all 
know how unrealistic a scenario *that* is.
/tongue-in-cheek

=
John H. Jenkins
jenk...@apple.com






Re: Offlist: complex rendering

2012-06-18 Thread John H. Jenkins

Naena Guru naenag...@gmail.com 於 2012年6月18日 下午3:50 寫道:

 Unicode says that it is all about codes and not shapes. It gave two examples, 
 Fraktur and Gaelic as scripts allowed to reside on Latin-1 but have shapes 
 not expected of Latin-1. That makes me wonder if Singhala is frowned upon 
 because it is not European. There is no other excuse because English was one 
 time romanized from fuþorc.


I'm going to regret this, but:

Unicode specifies semantics, not shapes.  The reason that drawing Latin 
characters with Sinhalese glyphs is incorrect is that they have different 
character semantics—that is, they behave differently.  It has nothing to do 
with Unicode failing to specify shapes.  

=
Siôn ap-Rhisiart
John H. Jenkins
jenk...@apple.com




Re: Unicode Core

2012-06-22 Thread John H. Jenkins

vanis...@boil.afraid.org 於 2012年6月22日 下午3:49 寫道:

 Wait a minute. Isn't 6.2 just adding the Turkish Lira? Does that really take 
 the chart people more than about 10 minutes?
 

The only *character* change is the Turkish lira.  There are numerous updates to 
UAXes and other parts of the documentation.  

=
Hoani H. Tinikini
John H. Jenkins
jenk...@apple.com






Re: texteditors that can process and save in different encodings

2012-10-04 Thread John H. Jenkins
BBEdit and TextWrangler on OS X both do a good job at handling different 
encodings.

On 2012年10月3日, at 下午10:58, Stephan Stiller stephan.stil...@gmail.com wrote:

 Dear all,
 
 In your experience, what are the best (plaintext) texteditors or word 
 processors for Linux / Mac OS X / Windows that have the ability to save in 
 many different encodings?
 
 This question is more specific than asking which editors have the best 
 knowledge of conversion tables for codepages (incl their different versions), 
 which I'm interested in as well. There are a number of programs that appear 
 to be able to read many different encodings – though I prefer the type that 
 actually tells me about where format errors are when a file is loaded. Then, 
 many editors that claim to be able to read all those encodings cannot display 
 them; as for that, I don't care about font choice and the aesthetics of 
 display, as I'm only interested in plaintext.
 
 Some things I have seen that are no good:
 the editor not telling me about the encoding and line breaks it has detected 
 and not letting me choose
 the editor displaying a BOM in hex mode even if there is none (a version of 
 UltraEdit I worked with at some point)
 
 Stephan
 



Re: xkcd: ‮LTR

2012-11-26 Thread John H. Jenkins
Or, if one prefers:

http://www.井作恆.net/XKCD/1137.html

On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:

 
 http://xkcd.com/1137/ 
 
 Finally, an xkcd for Unicoders. :-)
 
 Debbie
 



Re: xkcd: LTR

2012-11-26 Thread John H. Jenkins
That's because the domain does, in fact, use sinograms and not Deseret.  (It's 
my Chinese name.)

On 2012年11月26日, at 下午1:54, Philippe Verdy verd...@wanadoo.fr wrote:

 I wonder why this IDN link appears to me using sinograms in its domain name, 
 instead of Deseret letters. The link works, but my browser cannot display it 
 and its displays the Punycoded name instead without decoding it.
 
 This is strange because I do have Deseret fonts installed and I can view 
 Unicoded HTML pages containing Deseret letters.
 
 
 2012/11/26 John H. Jenkins jenk...@apple.com
 Or, if one prefers:
 
 http://www.井作恆.net/XKCD/1137.html
 
 On 2012年11月21日, at 上午10:22, Deborah Goldsmith golds...@apple.com wrote:
 
 
 http://xkcd.com/1137/ 
 
 Finally, an xkcd for Unicoders. :-)
 
 Debbie
 
 
 



Re: ‮LTR

2012-11-29 Thread John H. Jenkins
I double-checked *very* carefully, and I did't see anything wrong at all.  :-)

You got sharp eyes there, Doug.

On 2012年11月28日, at 下午10:58, Doug Ewell d...@ewellic.org wrote:

 John H. Jenkins wrote:
 
 Or, if one prefers:
 
 http://www.井作恆.net/XKCD/1137.html
 
 In all the ensuing discussion about this page, did anyone notice the typo in 
 the Deseret cartoon?
 
 ☺
 
 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell ­ 
 





Re: I missed my self-imposed deadline for the Mayan numeral proposal

2012-12-21 Thread John H. Jenkins
http://xkcd.com/998/

On 2012年12月21日, at 下午4:22, Doug Ewell d...@ewellic.org wrote:

 And as you've no doubt heard to death by now, real Maya don't believe in that 
 apocalyptic mumbo-jumbo anyway. Today was a celebration.
 
 --
 Doug Ewell | Thornton, Colorado, USA
 http://www.ewellic.org | @DougEwell
 From: Julian Bradfield
 Sent: ‎12/‎21/‎2012 15:55
 To: unicode@unicode.org
 Subject: Re: I missed my self-imposed deadline for the Mayan numeral proposal
 
 On 2012-12-21, Clive Hohberger cp...@case.edu wrote:
  Don't worry, I think you now have another 5351 years until the next Mayan
  Doomsday...
 
 It's only 394 years till the next b'ak'tun.
 
 -- 
 The University of Edinburgh is a charitable body, registered in
 Scotland, with registration number SC005336.
 
 
 



Re: Ideograms (was: Spiral symbol)

2013-01-30 Thread John H. Jenkins

On 2013年1月30日, at 上午4:50, Andreas Stötzner a...@signographie.de wrote:

 Most ideographs in use are pictographs, for obvious reasons. But it would be 
 nice indeed to have ideograms for “thanks”,

謝

 “please”,

請

 “yes”,

對

 “no”,

不

 “perhaps”

許

 – all those common notions which cannot be de-*picted* in the true sense of 
 the word.
 


I'm not being entirely snarky here. The whole reason why the term ideograph 
got attached to Chinese characters in the first place is that they can convey 
the same meaning while representing different words in different languages. 
Chinese writing was one of the inspirations for Leibniz' Characteristica 
universalis, for example.  

Personally, I think that extensive reliance on ideographs for communication is 
a bad idea. Again, Chinese illustrates this. The grammars of Chinese and 
Japanese are so very different that although hanzi are perfectly adequate for 
the writing of a large number of Sinitic languages, they are completely 
inadquate for Japanese.  Ideographs are fine for some short, simple messages 
(The lady's room lieth behind yon door), but not for actually expressing 
*language*.

And, in any event, if you *really* want non-pictographic ways of conveying 
abstract ideas, most of the work has been already done for you.




Re: Ideograms

2013-01-30 Thread John H. Jenkins
I happen to disagree slightly with De Francis on this point, BTW.  Senso 
strictu, he is correct, but looking at it in such a limited way minimizes the 
cross-language utility of sinograms. 花 *means* flower whether it's Mandarin 
huā or Japanese ka or Japanese hana. Indeed, the fact that the same kanji 
can be used for both native Japanese words and Chinese loan-words illustrates 
my point.

De Francis' point is that you can't use hanzi for real communication other than 
the most basic (e.g., street signs). 花 means flower in China and Japan 
because it represents the Chinese morpheme for flower and the Japanese 
equivalent, not because it has any inherent meaning per se.  I feel that 
since many hanzi represent equivalent morphemes in several different languages, 
they can actually be said to have inherent meaning for all practical intents 
and purposes.

A Japanese reader can see the sentence 我有一只猫。 and come away with a general 
sense that it has something to do with a cat, but they can't *read* it any 
more than a Chinese speaker can truly read the sentence 私は猫を所有している。 OTOH, 
both Japanese and Chinese can find 日本 on a map without any trouble, since it 
means day-root in both languages.  (Actually, it means Japan in both 
languages, but it literally means day-root, too, and I think that sounds more 
poetic.)

On 2013年1月30日, at 下午12:08, Charlie Ruland rul...@luckymail.com wrote:

 Yes, and on page 145 DeFrancis comes to the following conclusion:
 
 Chinese characters represent words (or better, morphemes), not ideas, and 
 they represent them phonetically, for the most part, as do all real writing 
 systems despite their diverse techniques and differing effectiveness in 
 accomplishing the task.
 
 The chapter these lines are from is also on-line: 
 http://www.pinyin.info/readings/texts/ideographic_myth.html .
 
 Charlie
 
 
 * Tim Greenwood timo...@greenwood.name [2013-01-30 20:17]:
 A very accessible book on all this is The Chinese Language: Fact and 
 Fantasy by John De Francis, published  in 1984 by University of Hawaii 
 Press. There is a brief synopsis on Wikipedia 
 http://en.wikipedia.org/wiki/The_Chinese_Language:_Fact_and_Fantasy
 
 - Tim
 
 
 
 On Wed, Jan 30, 2013 at 1:46 PM, John H. Jenkins jenk...@apple.com wrote:
 
 On 2013年1月30日, at 上午4:50, Andreas Stötzner a...@signographie.de wrote:
 
 Most ideographs in use are pictographs, for obvious reasons. But it would 
 be nice indeed to have ideograms for “thanks”,
 
 謝
 
 “please”,
 
 請
 
 “yes”,
 
 對
 
 “no”,
 
 不
 
 “perhaps”
 
 許
 
 – all those common notions which cannot be de-*picted* in the true sense of 
 the word.
 
 
 
 I'm not being entirely snarky here. The whole reason why the term 
 ideograph got attached to Chinese characters in the first 
 place is that they can convey the same meaning while representing different 
 words in different languages. Chinese writing was one of the inspirations 
 for Leibniz' Characteristica universalis, for example.  
 
 Personally, I think that extensive reliance on ideographs for communication 
 is a bad idea. Again, Chinese illustrates this. The grammars of Chinese and 
 Japanese are so very different that although hanzi are perfectly adequate 
 for the writing of a large number of Sinitic languages, they are completely 
 inadquate for Japanese.  Ideographs are fine for some short, simple messages 
 (The lady's room lieth behind yon door), but not for actually expressing 
 *language*.
 
 And, in any event, if you *really* want non-pictographic ways of conveying 
 abstract ideas, most of the work has been already done for you.
 
 
 



Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

2013-02-01 Thread John H. Jenkins

On 2013年2月1日, at 上午6:07, Costello, Roger L. coste...@mitre.org wrote:

 So why would one ever generate text in decomposed form (NFD)?
 

The Unihan database is stored in NFD because it makes the regular expressions 
used to qualify its contents much, *much* simpler.  I imagine that things like 
fuzzy text matching are easier in NFD.  At worst, it's about as useful as 
UTF-32: occasionally very handy in internal processing, but not terribly 
attactive overall.






Re: Encoding localizable sentences (was: RE: UTC Document Register Now Public)

2013-04-19 Thread John H. Jenkins

On 2013年4月19日, at 下午1:52, Stephan Stiller stephan.stil...@gmail.com wrote:

 But I'd argue that the distance of the information content of such 
 low-quality translations to the information content conveyed by correct and 
 polished language is often tolerable. Grammar isn't that important for 
 getting one's point across.

As my daughter says, Talking is for to be understood, so if the meaning 
conveyed, the point happened.



Re: Standaridized variation sequences for the Desert alphabet?

2017-03-27 Thread John H. Jenkins

> On Mar 27, 2017, at 2:04 AM, James Kass <jameskass...@gmail.com> wrote:
> 
>> 
>> If we have any historic metal types, are there
>> examples where a font contains both ligature
>> variants?
> 
> Apparently not.
> 
> John H. Jenkins mentioned early in this thread that these ligatures
> weren't used in printed materials and were not part of the official
> Deseret set.  They were only used in manuscript.
> 

This is correct. Neither of the nineteenth century metal types included the 
letters in question. Nor were they included in any electronic fonts that I'm 
aware of before they were included in Unicode. 



Re: Standaridized variation sequences for the Desert alphabet?

2017-03-22 Thread John H. Jenkins
My own take on this is "absolutely not." This is a font issue, pure and simple. 
There is no dispute as to the identity of the characters in question, just 
their appearance. 

In any event, these two letters were never part of the "standard" Deseret 
Alphabet used in printed materials. To the extent they were used, it was in 
hand-written material only, where you're going to see a fair amount of 
variation anyway. There were also two recensions of the DA used in printed 
materials which are materially different, and those would best be handled via 
fonts.

It isn't unreasonable to suggest we change the glyphs we use in the Standard. 
Ken Beesley and I have have discussed the possibility, and we both feel that 
it's very much on the table.


Re: Standaridized variation sequences for the Desert alphabet?

2017-03-29 Thread John H. Jenkins

> On Mar 29, 2017, at 4:12 AM, Martin J. Dürst  wrote:
> 
> Let me start with a short summary of where I think we are at, and how we got 
> there.
> 
> - The discussion started out with two letters,
>  with two letter forms each. There is explicit talk of the
>  40-letter alphabet and glyphs in the Wikipedia page, not
>  of two different letters.
> - That suggests that IF this script is in current use, and the
>  shapes for these diphthongs are interchangeable (for those
>  who use the script day-to-day, not for meta-purposes such
>  as historic and typographic texts), keeping things unified
>  is preferable.
> - As far as we have heard (in the course of the discussion,
>  after questioning claims made without such information),
>  it seems that:
>  - There may not be enough information to understand how the
>creators and early users of the script saw this issue,
>on a scale that may range between "everybody knows these
>are the same, and nobody cares too much who uses which,
>even if individual people may have their preferences in
>their handwriting" to something like "these are different
>choices, and people wouldn't want their texts be changed
>in any way when published".

I see this part of the problem more one of proper transcription of existing 
materials, and less of one of what the original authors saw the issues as. 
Handwritten material is very important in the study of 19th century LDS 
history, and although the materials actually in the DA are scant (at best), the 
peculiarities of the spelling can be instructive. As such, I certainly agree 
that being able to transcribe material "faithfully" is important.

I'm not an expert in this area, though, so I can't speak for myself whether 
this separate encoding or variation selectors or some other mechanism is the 
best way to provide support for this. I'm more than happy to defer to Michael 
and other people who *are* experts. If paleographers think separate encoding is 
best, then I'm for separate encoding. 

>  - Similarly, there seem to be not enough modern practitioners
>of the script using the ligatures that could shed any
>light on the question asked in the previous item in a
>historical context, first apparently because there are not
>that many modern practitioners at all, and second because
>modern practitioners seem to prefer spelling with
>individual letters rather than using the ligatures.

Well, as one of the people in this camp, and as Michael has pointed out, I 
eschew use of these letters altogether. I restrict myself to the 1869 version 
of the alphabet, which is used in virtually all of the printed materials and 
has only thirty-eight letters. 

> - IF the above is true, then it may be that these ligatures
>  are mostly used for historic purposes only, in which case
>  it wouldn't do any harm to present-day users if they were separated.
> 
> If the above is roughly correct, then it's important that we reached that 
> conclusion after explicitly considering the potential of a split to create 
> inconvenience and confusion for modern practitioners, not after just looking 
> at the shapes only, coming up with separate historical derivations for each 
> of them, and deciding to split because history is way more important than 
> modern practice.


Fortunately, since the existing Deseret block is full, any separately encoded 
entities will have to be put somewhere else, making it easier to document the 
nature and purpose of the symbols involved. 

Not that we can be confident that it will help. 
(http://www.deseretalphabet.info/XKCD/1726.html 
)





Re: Standaridized variation sequences for the Desert alphabet?

2017-03-27 Thread John H. Jenkins

> On Mar 27, 2017, at 9:56 AM, John H. Jenkins <jenk...@apple.com> wrote:
> 
> 
>> On Mar 27, 2017, at 2:04 AM, James Kass <jameskass...@gmail.com 
>> <mailto:jameskass...@gmail.com>> wrote:
>> 
>>> 
>>> If we have any historic metal types, are there
>>> examples where a font contains both ligature
>>> variants?
>> 
>> Apparently not.
>> 
>> John H. Jenkins mentioned early in this thread that these ligatures
>> weren't used in printed materials and were not part of the official
>> Deseret set.  They were only used in manuscript.
>> 
> 
> This is correct. Neither of the nineteenth century metal types included the 
> letters in question. Nor were they included in any electronic fonts that I'm 
> aware of before they were included in Unicode. 
> 

This should teach me to double-check before posting. Apparently, the earlier 
typeface *did* include all forty letters; it just didn't use these two. I don't 
know what glyphs were used.



Re: Plane-2-only string

2017-11-13 Thread John H. Jenkins via Unicode
Ʃ ̥ ́ Ӽ Մ ݭ ݹ ந ன ோ ௦ ௽ ఋ ల ు ూ ృ ౓ ౘ ౥ ౷ ౸ ಜ ೏ ೕ ೖ ക ര േ ൈ ൉ ൩ ൯ ർ ൾ ൿ ග ට ඲ ฉ 

That is an example of forty Cantonese-specific characters which are not obscene 
(that I'm aware of) from Extension B. For the curious, I've appended at the 
bottom the full list of 280 for all of Plane 2 which I was able to pull out of 
the Unihan database. I'm sure some enterprising poet can make something out of 
them.

> On Nov 13, 2017, at 11:20 AM, Peter Constable via Unicode 
>  wrote:
> 
> I’m wondering if anyone could come up with a string of 15 to 40 characters 
> _using only plane 2 characters_ that wouldn’t be gibberish?
> 
> We are considering adding sample-text strings in some of our fonts. (In 
> OpenType, the ‘name’ table can take sample-text strings using name ID 19.) 
> One particular issue we have is the Simsun-ExtB and MingLiU-ExtB fonts, which 
> have CJK characters from plane 2 only.
> 
> Background:
> The Simsun-ExtB and MingLiU-ExtB fonts are meant to complement the Simsun and 
> MingLiU fonts: the combined glyph count exceeds the number of glyphs that can 
> be added in a single OpenType font, and so the “ExtB” fonts are used to 
> contain all of the Plane 2 characters that are supported. For example, the 
> Simsun font supports 28738 BMP characters, and no plane 2 characters, while 
> Simsun-ExtB supports the Basic Latin block from the BMP plus 47,293 plane 2 
> characters. The combined glyph count exceeds 64K, so can’t go into a single 
> font.
> 
> 
> 
> Peter
> 

U+201A9 faan2   (Cant.) to play
U+20325 wu1 wu3 (Cant.) to bow, stoop
U+20341 man3(Cant.) an undesirable situation
U+204FC sip3(Cant.) a wedge; to thrust in
U+20544 nap1(Cant.) 酒Մ, a dimple
U+2076D peng2   (Cant.) to fell, cut; to sweep away
U+20779 gaai3   (Cant.) to cut with a knife or scissors
U+20BA8 naai3   (Cant.) to tie, tow; bring along
U+20BA9 aa1 liu1(Cant.) an interjection; rare, specialized
U+20BCB jai4 jai5   (Cant.) naughty, inferior
U+20BE6 cai3(Cant.) to eat, take a meal
U+20BFD zi1 (Cant.) a final particle indicating affirmation
U+20C0B jaau1   (Cant.) left-handed
U+20C32 eot1(Cant.) to belch
U+20C41 tam3(Cant.) to fool, trick, cheat
U+20C42 dat1(Cant.) to put something or sit wherever one wishes; to 
rebuke, reproach
U+20C43 nip1(Cant.) thin, flat; poor
U+20C53 ngai1   (Cant.) to importune, beg
U+20C58 ngaak6  (Cant.) contrary, opposing, against; disobedient
U+20C65 fik1 jit6 we5   (Cant.) wrangling, a noise; fitful; a soft 
fabric with no body
U+20C77 ming1   (Cant.) small
U+20C78 san2 seon2  (Cant.) phonetic
U+20C9C zaang1  (Cant.) to owe
U+20CCF ce2 ce6 (Cant.) interjection
U+20CD5 caau3   (Cant.) to search
U+20CD6 dap6(Cant.) to strike, pound
U+20D15 miu2(Cant.) to purse the lips; to wriggle
U+20D30 gau6(Cant.) classifier for a piece or lump of something
U+20D47 keu4(Cant.) peculiar, strange
U+20D48 mui2(Cant.) to suck or chew without using the teeth
U+20D49 hong4   (Cant.) hope
U+20D69 go2 (Cant.) that
U+20D6F gwit1 gwit3 (Cant.) onomatopoetic
U+20D7C mang1 mang4 (Cant.) scars on the eyelid; phonetic
U+20D7E waak1   (Cant.) eloquent, sharp-tongued
U+20D7F pe1 pe5 (Cant.) a pair (from the Engl.); to stagger
U+20D9C zai3(Cant.) to do, work; to be willing
U+20DA7 dim6(Cant.) straight, vertical; OK; to pick up with the 
fingers; verbal aspect marker of successful completion
U+20DB2 gap6 kap6   (Cant.) to stare at; to take a big bite
U+20E09 kak1(Cant.) to block, obstruct
U+20E0A tap1(Cant.) an intensifying particle
U+20E0E naa1(Cant.) and, with
U+20E0F ge2 (Cant.) final particle
U+20E10 kam1(Cant.) to endure, last
U+20E11 soek3   (Cant.) soft, sodden
U+20E12 bou2(Cant.) 生ฒ人, a stranger
U+20E3A ngaak6  (Cant.) contrary, opposing
U+20E6D ko1 (Cant.) to call (Engl. loan-word)
U+20E73 git6(Cant.) thick, viscous, dense
U+20E77 ngo4(Cant.) to speak tirelessly
U+20E78 kam2(Cant.) to cover, close up
U+20E7A maai4   (Cant.) verbal aspect marker for comletion or movement 
towards
U+20E7B zam6(Cant.) classifier for smells
U+20E8C gwe1(Cant.) timid
U+20E98 long1 long2 (Cant.) hard to get along with; to rinse, 
spread thin
U+20E9D gaak3   (Cant.) final particle
U+20EA2 gaa1 gaa2   (Cant.) final particle
U+20EAA he3 hi1 (Cant.) in a rush; slovenly
U+20EAB leu1(Cant.) strange, peculiar
U+20EAC he2 (Cant.) final particle
U+20ED7 le4 (Cant.) imperative final particle
U+20ED8  

Re: Emoji for major planets at least?

2018-01-18 Thread John H. Jenkins via Unicode
Well, you can go with Venus = white planet, Mercury = grey planet, Uranus = 
greenish planet, Neptune = bluish planet, Jupiter = striped planet.

As you say, though, without a context, none of them convey much and Venus, at 
least, would just be a circle. 

Plus there's the question of the context in which someone would want to send 
little pictures of the planets. This sounds like it would be adding emoji just 
because.

> On Jan 18, 2018, at 10:44 AM, Asmus Freytag via Unicode  
> wrote:
> 
> On 1/18/2018 6:55 AM, Shriramana Sharma via Unicode wrote:
>> Hello people.
>> 
>> We have sun, earth and moon emoji (3 for the earth and more for the
>> moon's phases). But we don't have emoji for the rest of the planets.
>> 
>> We have astrological symbols for all the planets and a few
>> non-existent imaginary "planets" as well.
>> 
>> Given this, would it be impractical to encode proper emoji characters
>> for the rest of the planets, at least the major ones whose physical
>> characteristics are well known and identifiable?
>> 
>> I mean for example identifying Sedna and Quaoar
>> (https://en.wikipedia.org/wiki/File:EightTNOs.png 
>> ) is probably not
>> going to be practical for all those other than astronomy buffs but the
>> physical shapes of the major planets are known to all high school
>> students…
>> 
> Earth = blue planet (with clouds)
> 
> Mars = red planet
> 
> Saturn = planet with rings
> 
> I don't think any of the other ones are identifiable in a context-free 
> setting, unless you draw a "big planet with red dot" for Jupiter.
> 
> Earth would have to be depicted in a way that doesn't focus on "hemispheres", 
> or you miss the idea of it as "planet".
> 
> 
> 
> A./
> 
> 
> 



Re: Support for Extension F

2018-01-31 Thread John H. Jenkins via Unicode
macOS (and iOS, for that matter) fully support Extension F provided fonts are 
availble. I'm not aware of any work that Apple has done to its fonts for 
Extension F support. Indeed, I'm not aware of any publically available fonts 
for Extension F but would gladly install one myself if it's available.

> On Jan 29, 2018, at 10:26 PM, via Unicode  wrote:
> 
> 
> Dear All,
> 
> As many of you are aware getting characters encoded is only half the battle, 
> enabling people to use them is the other half.
> 
> CJK Extenion F was added last year in Unicode 10. I have come across a number 
> of people saying they are having problems with Ext F. I was wondering what 
> the current support is for Ext F at OS level and in terms of fonts.
> 
> Regards
> John Knightley



Re: Memoji

2018-07-09 Thread John H. Jenkins via Unicode
Memoji are not merely animated emoji; they are personalized avatars. 

As for animated emoji, I expect that the UTC would consider them out-of-scope 
for plain text. Note that web pages can already contain animated or moving 
elements which cannot be represented in plain text. 

> On Jul 9, 2018, at 4:18 AM, William_J_G Overington via Unicode 
>  wrote:
> 
> I have seen the following video.
> 
> https://www.youtube.com/watch?v=CjqERCCD4iM
> 
> How will memoji be communicated from one device to another?
> 
> What happens if a message containing a memoji gets into a web page, such as 
> in the archives of this mailing list?
> 
> So, I am wondering whether memoji will become encoded into Unicode?
> 
> Will Unicode also have animation features?
> 
> This could be done with characters such as
> 
> ANIMATION START MARKER
> 
> ANIMATION FRAME SEPARATOR
> 
> ANIMATION FINISH MARKER
> 
> together with some more characters so as to specify frame duration 
> individually for each frame in milliseconds if other then a default 2000 
> milliseconds is wanted for a particular frame.
> 
> Could a message using memoji then be streamed using a plain text link?
> 
> William Overington
> 
> Monday 9 July 2018
> 



Re: Unicode 11.0 and 12.0 Cover Design Art

2018-03-13 Thread John H. Jenkins via Unicode
Maybe we should just throw in the towel and put "DON'T PANIC" on the cover in 
big, friendly letters. 





<    1   2   3