RE: The result of the plane 14 tag characters review.
Kenneth Whistler wrote: Ahem... The Unicode Technical Committee would like to announce that no formal decision has been taken regarding the deprecation of Plane 14 language tag characters. The period for public review of this issue will be extended until February 14, 2003. Out of curiosity, how did you leave the press room without passing through the riots? Deprecate Plane 14 Now militants has been fighting police for the whole afternoon, but no car with the Unicode flag was seen passing through. _ Marco
Re: N2515: Request for Roadmap - plane 3
On Wed, 13 Nov 2002 02:03:27 -0800 (PST), John H. Jenkins wrote: Nope. We're still doing modern stuff. Well, there's no rush, just as long as you get round to it sometime ... how about reserving a plane now anyway ? All in all, I wouldn't be surprised if there were as many as ten thousand or so genuinely distinct characters in modern use which have yet to be encoded. I'm really sceptical about this. Is there anywhere where I can see the proposals for CJK-C additions ? Andrew
Re: The result of the plane 14 tag characters review.
At 21:50 -0800 2002-11-12, Doug Ewell wrote: 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) HTML and XML markup? -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: The result of the plane 14 tag characters review.
On 11/13/2002 05:40:53 AM Michael Everson wrote: At 21:50 -0800 2002-11-12, Doug Ewell wrote: 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) HTML and XML markup? Doug was already comparing the plane 14 characters to HTML and XML, and clearly considers the latter to be relatively heavy -- and certainly they are heavier. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Re: The result of the plane 14 tag characters review.
On 11/12/2002 11:50:51 PM Doug Ewell wrote: 1. What extra processing is necessary to interpret Plane 14 tags that wouldn't be necessary to interpret any other form of tags? Obviously, extra processing is needed either way. 2. What extra processing is necessary to ignore Plane 14 tags that wouldn't be necessary to ignore any other Unicode character(s)? None. And if some form of light-weight markup were used, then there would inevitably be a need for some kind of character escape mechanism, so ignoring language tagging would still entail interpreting of the escapes. E.g. #LT=en#This is English text, #LT=fr# et ce texte ci est en français. #LT=en#To use the pound character in text, as in He's in room ##4, you have to encode it twice. 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? None that I can think of. Corollary: Is lightweight important? Is this a corollary? It may be the crux of the issue. Tags using plane 14 characters may be the lightest mechanism around, but does anybody actually need to avoid markup that badly? - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
RE: The result of the plane 14 tag characters review.
Hi. 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) HTML and XML markup? Doug was already comparing the plane 14 characters to HTML and XML, and clearly considers the latter to be relatively heavy -- and certainly they are heavier. Hm. lang=en...\lang that are 9+7 = 16 characters to indicate the language (and end of tag) All of them are ASCII, therefore encoded as 1 byte utf-8 each. Plane 14 requires 4 byte utf-8 each, and at least 3 characters (two tag-letters and the end-tag) - this is 12 bytes. Ok, this is less heavy, but not very much. Or what do you think what weight in this context means?!? Best regards. -- Dominikus Scherkl [EMAIL PROTECTED]
Re: The result of the plane 14 tag characters review.
Dominikus Scherkl scripsit: Or what do you think what weight in this context means?!? I assumed it refers to protocol/parsing complexity. Stripping P14 tags is done without even a finite-state machine, whereas parsing XML requires a real parser. -- Winter: MIT, John Cowan Keio, INRIA,[EMAIL PROTECTED] Issue lots of Drafts. http://www.ccil.org/~cowan So much more to understand! http://www.reutershealth.com Might simplicity return?(A tanka, or extended haiku)
Re: N2515: Request for Roadmap - plane 3
On Wednesday, November 13, 2002, at 03:22 AM, Andrew C. West wrote: On Wed, 13 Nov 2002 02:03:27 -0800 (PST), John H. Jenkins wrote: Nope. We're still doing modern stuff. Well, there's no rush, just as long as you get round to it sometime ... how about reserving a plane now anyway ? Because there's no indication that we'll need a full plane, basically. All in all, I wouldn't be surprised if there were as many as ten thousand or so genuinely distinct characters in modern use which have yet to be encoded. I'm really sceptical about this. Is there anywhere where I can see the proposals for CJK-C additions ? http://www.cse.cuhk.edu.hk/~irg/irg/extc/CJK_Ext_C.htm == John H. Jenkins [EMAIL PROTECTED] [EMAIL PROTECTED] http://www.tejat.net/
RE: The result of the plane 14 tag characters review.
Doug Ewell wrote: 1. What extra processing is necessary to interpret Plane 14 tags that wouldn't be necessary to interpret any other form of tags? In order for the question to make sense, we should compare plain text with plain text and rich text with rich text. 1.a) Take plain text: however lightweight it may be to process (or strip) Plane 14 tags, it is anyway heavier than zero, which is the amount of processing that would be needed by Plane 14 tags if they did not exist, or which is needed if they are ignored. 1.b) Take rich text: the processing cost of plain-text is the sum of the processing costs of each piece of plain-text resulting from the interpretation of that rich-text protocol. Any additional cost is irrelevant to this comparison, because it only depends on the complexity of the higher protocol, and because it occurs *before* the plain-text fragments are available for processing. E.g., the extra processing needed to parse XML syntax (including XML language tagging) is not to be counted as plain-text processing. 2. What extra processing is necessary to ignore Plane 14 tags that wouldn't be necessary to ignore any other Unicode character(s)? No extra processing would be necessary to ignore Plane 14 tags that wouldn't be necessary to ignore any other Unicode characters. But I fail to see the point of this question. 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) A lighter-weight method is not having language tagging at all in plain text. This is appropriate in two cases: 3.a) When you don't language tagging. 4.b) When language tagging can be provided by a higher level protocol. My assumption is that plain text always falls in case (3.a), and rich text always falls in case (4.b). So far, I haven't seen any proof that this assumption is incorrect. _ Marco
Re: The result of the plane 14 tag characters review.
Michael Everson everson at evertype dot com wrote: 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) HTML and XML markup? and Peter_Constable at sil dot org replied: Doug was already comparing the plane 14 characters to HTML and XML, and clearly considers the latter to be relatively heavy -- and certainly they are heavier. Certainly I don't want to claim, as some have, that HTML and XML and SGML are *very* heavy. But there is definitely a difference. HTML language tags (used here to include the slightly more complex XML syntax as well) are of the form lang=xx, whereas Plane 14 tags are of the form ?xx where ? represents U+E0001 and xx, the language identifier, is translated to Plane 14. (HTML allows the alternative form lang=xx without quotation marks, but XML does not.) In either case, there is clearly more parsing to be done in the case of HTML: * the spelling of the tag lang must be checked; * alternatively, it might be another type of tag altogether (not a language tag); * the equal sign = must be checked; * there must be exactly 0 (HTML optional) or 2 quotation marks surrounding the identifier; * the greater-than sign must be checked. Plane 14 tags begin with a single, dedicated code point that means language tag, so no syntax checking is needed at that point. The language identifier itself is encoded by dedicated code points, so checking for the end of the tag is simpler (last character in the tag range, or end of stream). Parsing the cancel tag is likewise simpler: /lang vs. U+E0001 U+E007F. For that matter, a Plane 14 cancel tag is not always necessary, which is not true in HTML. Any syntax checking of the identifier itself (e.g. en is valid but em is not) must be performed regardless of the mechanism, so neither approach holds an advantage there. Peter continued: 2. What extra processing is necessary to ignore Plane 14 tags that wouldn't be necessary to ignore any other Unicode character(s)? None. And if some form of light-weight markup were used, then there would inevitably be a need for some kind of character escape mechanism, so ignoring language tagging would still entail interpreting of the escapes. E.g. #LT=en#This is English text, #LT=fr# et ce texte ci est en français. #LT=en#To use the pound character in text, as in He's in room ##4, you have to encode it twice. Exactly. With the dedicated code points in Plane 14, you don't need either the closing tag or the double-# escaping scheme. I am not arguing that it takes Herculean effort to program a parser for ASCII-based language tags, only that Plane 14 tags are even simpler, and that some text applications call for the mechanism of greater simplicity. -Doug Ewell Fullerton, California
RE: The result of the plane 14 tag characters review.
I wrote: [...] A lighter-weight method is not having language tagging at all in plain text. This is appropriate in two cases: 3.a) When you don't language tagging. [...] ^ Sorry: I meant: When you don't need _ Marco
Re: The result of the plane 14 tag characters review.
Marco Cimarosti marco dot cimarosti at essetre dot it wrote: 2. What extra processing is necessary to ignore Plane 14 tags that wouldn't be necessary to ignore any other Unicode character(s)? No extra processing would be necessary to ignore Plane 14 tags that wouldn't be necessary to ignore any other Unicode characters. But I fail to see the point of this question. The point is to refute the argument that Plane 14 tags cause extra work for the vast majority of applications that choose to ignore them. If they are Unicode-conformant, they already have to ignore characters that they don't understand. Ignoring Plane 14 tags is as easy as ignoring Cherokee. 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) A lighter-weight method is not having language tagging at all in plain text. This is appropriate in two cases: 3.a) When you don't [need] language tagging. Then don't use it. I have never suggested that all Unicode text must be language-tagged. 4.b) When language tagging can be provided by a higher level protocol. Then use the tagging mechanism provided by the higher-level protocol instead, IF you were going to use the higher-level protocol anyway. There are lots of cases where HTML functionality duplicates, and overrides, plain-text functionality; see UTR #20 for numerous examples. As I mentioned in my paper, even the venerable CR/LF is overridden by HTML p and br -- and this is fine. There is no need to deprecate CR and LF because of this, or to prohibit them in HTML files. The same should be true of Plane 14. -Doug Ewell Fullerton, California
Re: The result of the plane 14 tag characters review.
On Wed, Nov 13, 2002 at 08:25:21AM -0600, [EMAIL PROTECTED] wrote: Is this a corollary? It may be the crux of the issue. Tags using plane 14 characters may be the lightest mechanism around, but does anybody actually need to avoid markup that badly? GNU Libc used them to round-trip ISO-2022-JP-3 last time I checked. -- David Starner - [EMAIL PROTECTED] Great is the battle-god, great, and his kingdom-- A field where a thousand corpses lie. -- Stephen Crane, War is Kind
RE: The result of the plane 14 tag characters review.
On 11/13/2002 09:03:26 AM Dominikus Scherkl wrote: Ok, this is less heavy, but not very much. Or what do you think what weight in this context means?!? There is weight in terms of bandwidth, but also in terms of mechanisms needed to interpret markup. It takes a lot more to handle HTML or XML than it does simply language tagging using plane 14 characters -- or some simpler form of markup than HTML / XML. I'm not trying to argue specifically in favour of the plane 14 characters, BTW. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: [EMAIL PROTECTED]
Double Byte Character Set (DBCS)
Paul, I am forwarding your inquiry to the Unicode list. I hope someone on the list will be able to address your question. Regards, Magda Danish Administrative Director The Unicode Consortium 650-693-3921 -Original Message- Date/Time:Wed Nov 13 09:32:39 EST 2002 Contact: [EMAIL PROTECTED] Report Type: Other Question, Problem, or Feedback Dear Sir/Madam, I am writing on behalf of a global software company called Rebus iS. Rebus iS has developed an underwriting package that has been very successful in Europeon markets. The application runs on an AS400 device but was developed primarily for English speaking countries. We are now looking to expand the market for this product into countries such as China. To achieve this I have been informed we need to enable our application for Double Byte Character Set (DBCS). This confuses me when I read of UNICODE. If the AS400 supports UNICODE, and assuming DBCS UNICODE are mutually exclusive, would it not make more sense to enable for UNICODE only from the outset? I would be grateful for your assistance in this matter. Kind Regards Paul Downey Rebus Insurance Systems Limited Registered No. 508212 England Registered Office Suffolk House, 102 - 108 Baxter Avenue, Southend-on-Sea, Essex SS2 6JP United Kingdom Tel:+44 (0) 1702 236691 Fax:+44 (0) 1702 353276 http://www.rebusis.com -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)
RE: The result of the plane 14 tag characters review.
I think Doug asked for lightweight. HTML and XML markup aren't lightweight by any means, although a special purpose plain-text oriented XML (LTML for language-tagged markup language) might not be that much more involved than plane 14 tags. It would also have the advantage that standard XSLT tools could be used to translate between LTML and XHTML, etc. Murray Michael Everson wrote: At 21:50 -0800 2002-11-12, Doug Ewell wrote: 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) HTML and XML markup?
IBM AIX 5 and GB18030
Dear I18N experts, I have searched all the web on IBM about the support of GB18030 in OS AIX 4.3 and 5, but didn't find anything. I only can see they support GB2312 and GBK. I know IBM was one of the pioneer to support GB18030, i.e. their ICU. But it doesn't make sense their AIX doesn't support it ? Please shed some lights ! Thanks ! Jane
Re: Double Byte Character Set (DBCS)
-Original Message- We are now looking to expand the market for this product into countries such as China. To achieve this I have been informed we need to enable our application for Double Byte Character Set (DBCS). DBCS is an old, pre-Unicode term for character sets with Chinese/Japanese/Korean characters. This confuses me when I read of UNICODE. If the AS400 supports UNICODE, and assuming DBCS UNICODE are mutually exclusive, would it not make more sense to enable for UNICODE only from the outset? Yes, exactly. In that sense, Unicode support is a superset of DBCS enablement. A Unicode-based application may have to convert between various charsets and Unicode on the edges, but all the support you need is provided internally by processing Unicode. Best regards, markus -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
Re: IBM AIX 5 and GB18030
xjliu_ca wrote: I have searched all the web on IBM about the support of GB18030 in OS AIX 4.3 and 5, but didn't find anything. I only can see they support GB2312 and GBK. Google found something for me: http://www-3.ibm.com/software/ts/mqseries/support/readme/aix530_read.html Search for 18030 on this page. Quote: On AIX V5.1, APAR IY26937 provides support for conversion between GB18030 (CCSID 5488) and Unicode. Support is NOT provided for the conversion between GB18030 and 1388 (EBCDIC). Conversion between these CCSIDs can cause unpredictable results. I know IBM was one of the pioneer to support GB18030, i.e. their ICU. :-) markus -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
Re: The result of the plane 14 tag characters review.
I have been watching this thread for some time now, and Doug Newell's comments have prompted me to add my two cent's worth. In an effort to unify all character and pictographs, the decision was made to unify CJK characters by suppressing most variant forms. That turns out to be the single greatest objection from users -- especially Japanese -- and somehow we need a low-level way of indicating the target language in the context of multilingual text. The plane 14 tags seem to be appropriate to do this, giving a hint to the font engine as to a good choice of alternate glyphs, where available. The problems occur first, because the code scanner can no longer be stateless; second, because one needs to provide an over-ride to higher-level layout engines; third, because it can't solve problems where multiple glyphs exist, whose use is highly context-dependent, as is the case for some Japanese texts; and fourth, because there is no one-one translation between the (largely) non-unified simplified and traditional characters in Chinese. It seems to me that the Unicode people should bite the bullet that where the unification process creates problems, a solution needs to be provided. The use of the language tags should be able to deal with most objections to rendering in a given language, _provided_ direction is given as to how the use of plane 14 tags should behave (I say, as a hint for glyph choice), and how the rendering engine should communicate with higher-order text processing. Note that I am _not_ advocating the use of such tags to describe font _styles_ although when dealing with long s, for instance, the boundary is fuzzy. To suggest that such fundamental glyph choices as linguistic preference should be left to high-level markup in text-processing applications, without providing a unified way to do it, seems to violate the spirit of Unicode. George Kenneth Whistler kenw at sybase dot com wrote: The Unicode Technical Committee would like to announce that no formal decision has been taken regarding the deprecation of Plane 14 language tag characters. The period for public review of this issue will be extended until February 14, 2003. Gee, a press conference after all. Too bad my TV was turned off. No, seriously, thanks for the update. I'm glad to see the matter was considered worthy of further study. Hopefully other people who have an opinion on Plane 14 will contribute to the public review. Ken also wrote: Doug's contribution would be more convincing if it dropped away the irrelevancies about whether the *function* of language tagging is useful and focussed completely on the appropriateness of this *particular* set of characters on Plane 14 as opposed to any other means of conveying the same distinctions. That's why I included a severability clause, to the effect that if one of my arguments was bogus (or irrelevant) it shouldn't affect the credibility of the others. To answer the question why Plane 14 plain-text instead of markup, I suppose I need to make the case that this meta-information is sometimes appropriate in short strings and labels where rich text is overkill. This was basically the argument put forth by the ACAP people. I did some homework on the MLSF proposal (a little late, I know) and saw that their primary perceived need was for tagging short strings in protocols which did not lend themselves to an additional rich-text layer. After seeing the MLSF tagging scheme, I agree more than ever that its deployment would have jeopardized the usefulness of UTF-8. Although the number of proposals like this to extend or enhance UTF-8 has diminished greatly since then, it would be a shame to see them resurface on the basis that Unicode doesn't provide us any alternative. To me, the most difficult part of the Save Plane 14 campaign seems to be convincing people that not every text problem lends itself to a markup solution. Without questioning the current and future importance of HTML and XML, there *is* text in the world that is not wrapped in one of these formats, and cannot be reasonably converted to them, yet still needs to be processed in some way. Judging from the discussion on the list last week, there also seems to be a perception that Plane 14 tags require a great deal of overhead, even to ignore them. I'd like to continue that discussion (especially since the public-review period has been extended) and ask: 1. What extra processing is necessary to interpret Plane 14 tags that wouldn't be necessary to interpret any other form of tags? 2. What extra processing is necessary to ignore Plane 14 tags that wouldn't be necessary to ignore any other Unicode character(s)? 3. Is there any method of tagging, anywhere, that is lighter-weight than Plane 14? (Corollary: Is lightweight important?) -Doug Ewell Fullerton, California -- Dr George W GerrityPhone: +61 2 6386 3431 GWG Associates Fax:+61 2 6386 3431 P O Box 229