RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Marco Cimarosti
Kenneth Whistler wrote:
 Ahem...
 
 The Unicode Technical Committee would like to announce that no
 formal decision has been taken regarding the deprecation of
 Plane 14 language tag characters. The period for public review of
 this issue will be extended until February 14, 2003.

Out of curiosity, how did you leave the press room without passing through
the riots? Deprecate Plane 14 Now militants has been fighting police for
the whole afternoon, but no car with the Unicode flag was seen passing
through.

_ Marco




Re: N2515: Request for Roadmap - plane 3

2002-11-13 Thread Andrew C. West
On Wed, 13 Nov 2002 02:03:27 -0800 (PST), John H. Jenkins wrote:

 Nope.  We're still doing modern stuff.
 

Well, there's no rush, just as long as you get round to it sometime ... how
about reserving a plane now anyway ?

 All in all, I wouldn't be surprised if there were as many as ten 
 thousand or so genuinely distinct characters in modern use which have 
 yet to be encoded.

I'm really sceptical about this. Is there anywhere where I can see the proposals
for CJK-C additions ?

Andrew




Re: The result of the plane 14 tag characters review.

2002-11-13 Thread Michael Everson
At 21:50 -0800 2002-11-12, Doug Ewell wrote:


3.  Is there any method of tagging, anywhere, that is lighter-weight
than Plane 14?  (Corollary: Is lightweight important?)


HTML and XML markup?
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: The result of the plane 14 tag characters review.

2002-11-13 Thread Peter_Constable

On 11/13/2002 05:40:53 AM Michael Everson wrote:

At 21:50 -0800 2002-11-12, Doug Ewell wrote:

3.  Is there any method of tagging, anywhere, that is lighter-weight
than Plane 14?  (Corollary: Is lightweight important?)

HTML and XML markup?

Doug was already comparing the plane 14 characters to HTML and XML, and
clearly considers the latter to be relatively heavy -- and certainly they
are heavier.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]







Re: The result of the plane 14 tag characters review.

2002-11-13 Thread Peter_Constable

On 11/12/2002 11:50:51 PM Doug Ewell wrote:

1.  What extra processing is necessary to interpret Plane 14 tags that
wouldn't be necessary to interpret any other form of tags?

Obviously, extra processing is needed either way.



2.  What extra processing is necessary to ignore Plane 14 tags that
wouldn't be necessary to ignore any other Unicode character(s)?

None. And if some form of light-weight markup were used, then there would
inevitably be a need for some kind of character escape mechanism, so
ignoring language tagging would still entail interpreting of the escapes.
E.g.

#LT=en#This is English text, #LT=fr# et ce texte ci est en français.
#LT=en#To use the pound character in text, as in He's in room ##4, you
have to encode it twice.



3.  Is there any method of tagging, anywhere, that is lighter-weight
than Plane 14?

None that I can think of.


Corollary: Is lightweight important?

Is this a corollary? It may be the crux of the issue. Tags using plane 14
characters may be the lightest mechanism around, but does anybody actually
need to avoid markup that badly?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]












RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Dominikus Scherkl
Hi.

   3.  Is there any method of tagging, anywhere, that is
lighter-weight
   than Plane 14?  (Corollary: Is lightweight important?)
  
  HTML and XML markup?
 
 Doug was already comparing the plane 14 characters to HTML and XML,
and
 clearly considers the latter to be relatively heavy -- and certainly
they
 are heavier.

Hm. lang=en...\lang
that are 9+7 = 16 characters to indicate the language (and end of tag)
All of them are ASCII, therefore encoded as 1 byte utf-8 each.
Plane 14 requires 4 byte utf-8 each, and at least 3 characters
(two tag-letters and the end-tag) - this is 12 bytes.
Ok, this is less heavy, but not very much.
Or what do you think what weight in this context means?!?

Best regards.
-- 
Dominikus Scherkl
[EMAIL PROTECTED]




Re: The result of the plane 14 tag characters review.

2002-11-13 Thread John Cowan
Dominikus Scherkl scripsit:

 Or what do you think what weight in this context means?!?

I assumed it refers to protocol/parsing complexity.  Stripping P14 tags
is done without even a finite-state machine, whereas parsing XML requires
a real parser.

-- 
Winter:  MIT,   John Cowan
Keio, INRIA,[EMAIL PROTECTED]
Issue lots of Drafts.   http://www.ccil.org/~cowan
So much more to understand! http://www.reutershealth.com
Might simplicity return?(A tanka, or extended haiku)




Re: N2515: Request for Roadmap - plane 3

2002-11-13 Thread John H. Jenkins
On Wednesday, November 13, 2002, at 03:22 AM, Andrew C. West wrote:


On Wed, 13 Nov 2002 02:03:27 -0800 (PST), John H. Jenkins wrote:


Nope.  We're still doing modern stuff.



Well, there's no rush, just as long as you get round to it sometime 
... how
about reserving a plane now anyway ?


Because there's no indication that we'll need a full plane, basically.


All in all, I wouldn't be surprised if there were as many as ten
thousand or so genuinely distinct characters in modern use which have
yet to be encoded.


I'm really sceptical about this. Is there anywhere where I can see the 
proposals
for CJK-C additions ?


http://www.cse.cuhk.edu.hk/~irg/irg/extc/CJK_Ext_C.htm


==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/





RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Marco Cimarosti
Doug Ewell wrote:
 1.  What extra processing is necessary to interpret Plane 14 tags that
 wouldn't be necessary to interpret any other form of tags?

In order for the question to make sense, we should compare plain text with
plain text and rich text with rich text.

1.a) Take plain text: however lightweight it may be to process (or strip)
Plane 14 tags, it is anyway heavier than zero, which is the amount of
processing that would be needed by Plane 14 tags if they did not exist, or
which is needed if they are ignored.

1.b) Take rich text: the processing cost of plain-text is the sum of the
processing costs of each piece of plain-text resulting from the
interpretation of that rich-text protocol. Any additional cost is irrelevant
to this comparison, because it only depends on the complexity of the higher
protocol, and because it occurs *before* the plain-text fragments are
available for processing. E.g., the extra processing needed to parse XML
syntax (including XML language tagging) is not to be counted as plain-text
processing.

 2.  What extra processing is necessary to ignore Plane 14 tags that
 wouldn't be necessary to ignore any other Unicode character(s)?

No extra processing would be necessary to ignore Plane 14 tags that wouldn't
be necessary to ignore any other Unicode characters. But I fail to see the
point of this question.

 3.  Is there any method of tagging, anywhere, that is lighter-weight
 than Plane 14?  (Corollary: Is lightweight important?)

A lighter-weight method is not having language tagging at all in plain text.
This is appropriate in two cases:

3.a) When you don't language tagging.

4.b) When language tagging can be provided by a higher level protocol.

My assumption is that plain text always falls in case (3.a), and rich text
always falls in case (4.b). So far, I haven't seen any proof that this
assumption is incorrect.

_ Marco




Re: The result of the plane 14 tag characters review.

2002-11-13 Thread Doug Ewell
Michael Everson everson at evertype dot com wrote:

 3.  Is there any method of tagging, anywhere, that is lighter-weight
 than Plane 14?  (Corollary: Is lightweight important?)

 HTML and XML markup?

and Peter_Constable at sil dot org replied:

 Doug was already comparing the plane 14 characters to HTML and XML,
 and clearly considers the latter to be relatively heavy -- and
 certainly they are heavier.

Certainly I don't want to claim, as some have, that HTML and XML and
SGML are *very* heavy.  But there is definitely a difference.

HTML language tags (used here to include the slightly more complex XML
syntax as well) are of the form lang=xx, whereas Plane 14 tags are
of the form ?xx where ? represents U+E0001 and xx, the language
identifier, is translated to Plane 14.  (HTML allows the alternative
form lang=xx without quotation marks, but XML does not.)  In either
case, there is clearly more parsing to be done in the case of HTML:

* the spelling of the tag lang must be checked;
* alternatively, it might be another type of tag altogether (not a
language tag);
* the equal sign = must be checked;
* there must be exactly 0 (HTML optional) or 2 quotation marks
surrounding the identifier;
* the greater-than sign  must be checked.

Plane 14 tags begin with a single, dedicated code point that means
language tag, so no syntax checking is needed at that point.  The
language identifier itself is encoded by dedicated code points, so
checking for the end of the tag is simpler (last character in the tag
range, or end of stream).

Parsing the cancel tag is likewise simpler:  /lang vs. U+E0001
U+E007F.  For that matter, a Plane 14 cancel tag is not always
necessary, which is not true in HTML.

Any syntax checking of the identifier itself (e.g. en is valid but
em is not) must be performed regardless of the mechanism, so neither
approach holds an advantage there.

Peter continued:

 2.  What extra processing is necessary to ignore Plane 14 tags that
 wouldn't be necessary to ignore any other Unicode character(s)?

 None. And if some form of light-weight markup were used, then there
 would inevitably be a need for some kind of character escape
mechanism,
 so ignoring language tagging would still entail interpreting of the
 escapes. E.g.

 #LT=en#This is English text, #LT=fr# et ce texte ci est en français.
 #LT=en#To use the pound character in text, as in He's in room ##4,
 you have to encode it twice.

Exactly.  With the dedicated code points in Plane 14, you don't need
either the closing tag or the double-# escaping scheme.

I am not arguing that it takes Herculean effort to program a parser for
ASCII-based language tags, only that Plane 14 tags are even simpler, and
that some text applications call for the mechanism of greater
simplicity.

-Doug Ewell
 Fullerton, California





RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Marco Cimarosti
I wrote:
[...]
 A lighter-weight method is not having language tagging at all 
 in plain text. This is appropriate in two cases:
 
 3.a) When you don't language tagging.
[...] ^

Sorry: I meant: When you don't need

_ Marco






Re: The result of the plane 14 tag characters review.

2002-11-13 Thread Doug Ewell
Marco Cimarosti marco dot cimarosti at essetre dot it wrote:

 2.  What extra processing is necessary to ignore Plane 14 tags that
 wouldn't be necessary to ignore any other Unicode character(s)?

 No extra processing would be necessary to ignore Plane 14 tags that
 wouldn't be necessary to ignore any other Unicode characters. But I
 fail to see the point of this question.

The point is to refute the argument that Plane 14 tags cause extra work
for the vast majority of applications that choose to ignore them.  If
they are Unicode-conformant, they already have to ignore characters that
they don't understand.  Ignoring Plane 14 tags is as easy as ignoring
Cherokee.

 3.  Is there any method of tagging, anywhere, that is lighter-weight
 than Plane 14?  (Corollary: Is lightweight important?)

 A lighter-weight method is not having language tagging at all in plain
 text. This is appropriate in two cases:

 3.a) When you don't [need] language tagging.

Then don't use it.  I have never suggested that all Unicode text must be
language-tagged.

 4.b) When language tagging can be provided by a higher level protocol.

Then use the tagging mechanism provided by the higher-level protocol
instead, IF you were going to use the higher-level protocol anyway.
There are lots of cases where HTML functionality duplicates, and
overrides, plain-text functionality; see UTR #20 for numerous examples.
As I mentioned in my paper, even the venerable CR/LF is overridden by
HTML p and br -- and this is fine.  There is no need to deprecate CR
and LF because of this, or to prohibit them in HTML files.  The same
should be true of Plane 14.

-Doug Ewell
 Fullerton, California





Re: The result of the plane 14 tag characters review.

2002-11-13 Thread David Starner
On Wed, Nov 13, 2002 at 08:25:21AM -0600, [EMAIL PROTECTED] wrote:
 Is this a corollary? It may be the crux of the issue. Tags using plane 14
 characters may be the lightest mechanism around, but does anybody actually
 need to avoid markup that badly?

GNU Libc used them to round-trip ISO-2022-JP-3 last time I checked.

-- 
David Starner - [EMAIL PROTECTED]
Great is the battle-god, great, and his kingdom--
A field where a thousand corpses lie. 
  -- Stephen Crane, War is Kind




RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Peter_Constable

On 11/13/2002 09:03:26 AM Dominikus Scherkl wrote:

Ok, this is less heavy, but not very much.
Or what do you think what weight in this context means?!?

There is weight in terms of bandwidth, but also in terms of mechanisms
needed to interpret markup. It takes a lot more to handle HTML or XML than
it does simply language tagging using plane 14 characters -- or some
simpler form of markup than HTML / XML.


I'm not trying to argue specifically in favour of the plane 14 characters,
BTW.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]







Double Byte Character Set (DBCS)

2002-11-13 Thread Magda Danish (Unicode)
Paul,

I am forwarding your inquiry to the Unicode list. I hope someone on the
list will be able to address your question.

Regards,

Magda Danish
Administrative Director
The Unicode Consortium
650-693-3921


 -Original Message-
 Date/Time:Wed Nov 13 09:32:39 EST 2002
 Contact:  [EMAIL PROTECTED]
 Report Type:  Other Question, Problem, or Feedback
 
 Dear Sir/Madam,
 
 I am writing on behalf of a global software company called 
 Rebus iS. Rebus iS has developed an underwriting package that 
 has been very successful in Europeon markets.
 
 The application runs on an AS400 device but was developed 
 primarily for English speaking countries. 
 
 We are now looking to expand the market for this product into 
 countries such as China. To achieve this I have been informed 
 we need to enable our application for Double Byte Character 
 Set (DBCS).
 
 This confuses me when I read of UNICODE. If the AS400 
 supports UNICODE, and assuming DBCS  UNICODE are mutually 
 exclusive, would it not make more sense to enable for UNICODE 
 only from the outset?
 
 I would be grateful for your assistance in this matter.
 
 Kind Regards
 
 Paul Downey
 Rebus Insurance Systems Limited
 Registered No. 508212 England
 Registered Office
 Suffolk House,
 102 - 108 Baxter Avenue,
 Southend-on-Sea,
 Essex SS2 6JP
 United Kingdom
 
 Tel:+44  (0) 1702 236691
 Fax:+44  (0) 1702 353276
 
http://www.rebusis.com



-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
(End of Report)





RE: The result of the plane 14 tag characters review.

2002-11-13 Thread Murray Sargent
I think Doug asked for lightweight. HTML and XML markup aren't
lightweight by any means, although a special purpose plain-text oriented
XML (LTML for language-tagged markup language) might not be that much
more involved than plane 14 tags. It would also have the advantage that
standard XSLT tools could be used to translate between LTML and XHTML,
etc.

Murray

Michael Everson wrote:

At 21:50 -0800 2002-11-12, Doug Ewell wrote:

3.  Is there any method of tagging, anywhere, that is lighter-weight 
than Plane 14?  (Corollary: Is lightweight important?)

HTML and XML markup?





IBM AIX 5 and GB18030

2002-11-13 Thread xjliu_ca
Dear I18N experts,

I have searched all the web on IBM about the support of GB18030 in OS 
AIX 4.3 and 5, but didn't find anything. I only can see they support 
GB2312 and GBK.

I know IBM was one of the pioneer to support GB18030, i.e. their ICU. 
But it doesn't make sense their AIX doesn't support it ?

Please shed some lights !

Thanks !

Jane





Re: Double Byte Character Set (DBCS)

2002-11-13 Thread Markus Scherer
-Original Message-
We are now looking to expand the market for this product into 
countries such as China. To achieve this I have been informed 
we need to enable our application for Double Byte Character 
Set (DBCS).

DBCS is an old, pre-Unicode term for character sets with Chinese/Japanese/Korean characters.


This confuses me when I read of UNICODE. If the AS400 
supports UNICODE, and assuming DBCS  UNICODE are mutually 
exclusive, would it not make more sense to enable for UNICODE 
only from the outset?

Yes, exactly. In that sense, Unicode support is a superset of DBCS enablement. A Unicode-based 
application may have to convert between various charsets and Unicode on the edges, but all the 
support you need is provided internally by processing Unicode.

Best regards,
markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.




Re: IBM AIX 5 and GB18030

2002-11-13 Thread Markus Scherer
xjliu_ca wrote:

I have searched all the web on IBM about the support of GB18030 in OS 
AIX 4.3 and 5, but didn't find anything. I only can see they support 
GB2312 and GBK.

Google found something for me:
http://www-3.ibm.com/software/ts/mqseries/support/readme/aix530_read.html
Search for 18030 on this page. Quote:

   On AIX V5.1, APAR IY26937 provides support for conversion between GB18030
   (CCSID 5488) and Unicode. Support is NOT provided for the conversion
   between GB18030 and 1388 (EBCDIC). Conversion between these CCSIDs can
   cause unpredictable results.



I know IBM was one of the pioneer to support GB18030, i.e. their ICU. 

:-)

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.





Re: The result of the plane 14 tag characters review.

2002-11-13 Thread George W Gerrity
I have been watching this thread for some time now, and Doug Newell's 
comments have prompted me to add my two cent's worth.

In an effort to unify all character and pictographs, the decision was 
made to unify CJK characters by suppressing most variant forms. That 
turns out to be the single greatest objection from users -- 
especially Japanese -- and somehow we need a low-level way of 
indicating the target language in the context of multilingual text.

The plane 14 tags seem to be appropriate to do this, giving a hint to 
the font engine as to a good choice of alternate glyphs, where 
available.

The problems occur first, because the code scanner can no longer be 
stateless; second, because one needs to provide an over-ride to 
higher-level layout engines; third, because it can't solve problems 
where multiple glyphs exist, whose use is highly context-dependent, 
as is the case for some Japanese texts; and fourth, because there is 
no one-one translation between the (largely) non-unified simplified 
and traditional characters in Chinese.

It seems to me that the Unicode people should bite the bullet that 
where the unification process creates problems, a solution needs to 
be provided. The use of the language tags should be able to deal with 
most objections to rendering in a given language, _provided_ 
direction is given as to how the use of plane 14 tags should behave 
(I say, as a hint for glyph choice), and how the rendering engine 
should communicate with higher-order text processing.

Note that I am _not_ advocating the use of such tags to describe font 
_styles_ although when dealing with long s, for instance, the 
boundary is fuzzy.

To suggest that such fundamental glyph choices as linguistic 
preference should be left to high-level markup in text-processing 
applications, without providing a unified way to do it, seems to 
violate the spirit of Unicode.

George

Kenneth Whistler kenw at sybase dot com wrote:


 The Unicode Technical Committee would like to announce that no
 formal decision has been taken regarding the deprecation of
 Plane 14 language tag characters. The period for public review of
 this issue will be extended until February 14, 2003.


Gee, a press conference after all.  Too bad my TV was turned off.

No, seriously, thanks for the update.  I'm glad to see the matter was
considered worthy of further study.  Hopefully other people who have an
opinion on Plane 14 will contribute to the public review.

Ken also wrote:


 Doug's contribution would be
 more convincing if it dropped away the irrelevancies about whether
 the *function* of language tagging is useful and focussed completely
 on the appropriateness of this *particular* set of characters on
 Plane 14 as opposed to any other means of conveying the same
 distinctions.


That's why I included a severability clause, to the effect that if one
of my arguments was bogus (or irrelevant) it shouldn't affect the
credibility of the others.

To answer the question why Plane 14 plain-text instead of markup, I
suppose I need to make the case that this meta-information is sometimes
appropriate in short strings and labels where rich text is overkill.
This was basically the argument put forth by the ACAP people.  I did
some homework on the MLSF proposal (a little late, I know) and saw that
their primary perceived need was for tagging short strings in protocols
which did not lend themselves to an additional rich-text layer.

After seeing the MLSF tagging scheme, I agree more than ever that its
deployment would have jeopardized the usefulness of UTF-8.  Although the
number of proposals like this to extend or enhance UTF-8 has
diminished greatly since then, it would be a shame to see them resurface
on the basis that Unicode doesn't provide us any alternative.

To me, the most difficult part of the Save Plane 14 campaign seems to
be convincing people that not every text problem lends itself to a
markup solution.  Without questioning the current and future importance
of HTML and XML, there *is* text in the world that is not wrapped in one
of these formats, and cannot be reasonably converted to them, yet still
needs to be processed in some way.

Judging from the discussion on the list last week, there also seems to
be a perception that Plane 14 tags require a great deal of overhead,
even to ignore them.  I'd like to continue that discussion (especially
since the public-review period has been extended) and ask:

1.  What extra processing is necessary to interpret Plane 14 tags that
wouldn't be necessary to interpret any other form of tags?

2.  What extra processing is necessary to ignore Plane 14 tags that
wouldn't be necessary to ignore any other Unicode character(s)?

3.  Is there any method of tagging, anywhere, that is lighter-weight
than Plane 14?  (Corollary: Is lightweight important?)

-Doug Ewell
 Fullerton, California



--
Dr George W GerrityPhone:  +61 2 6386 3431
GWG Associates Fax:+61 2 6386 3431
P O Box 229