New I18N position at W3C/Keio University

2004-12-22 Thread Martin Duerst
Dear friends, colleagues, everybody,
W3C has opened a position in Internationalization at Keio University
in Japan, because I'm leaving the W3C Team at the end of March.
For details, please see http://www.w3.org/2004/12/i18nposition.
For other open positions at W3C, please see
http://www.w3.org/Consortium/Recruitment/.
Please feel free to forward this announcement to anybody who
may be interested. Sorry if you receive duplicates.
Regards,   Martin.
#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-#  mailto:[EMAIL PROTECTED]   http://www.w3.org/People/D%C3%BCrst 




new mailing list: public-ietf-collation@w3.org

2004-08-15 Thread Martin Duerst
Dear Unicoders,
Some of you may be interested in this:
After discussion with Chris Newman, author of the Internet Draft
http://www.ietf.org/internet-drafts/draft-newman-i18n-comparator-02.txt,
we have created a new mailing list, [EMAIL PROTECTED],
for discussion (and hopefully completion) of this work on identifiers
for collations. This mailing list replaces an earlier one hosted at
a different place.
If you want to contribute or are interested in this work, please
subscribe by sending mail to [EMAIL PROTECTED]
(capitalization irrelevant) with "subscribe" (without the quotes)
in the subject. The archives of this mailing list can be found at
http://lists.w3.org/Archives/Public/public-ietf-collation/,
and are publicly accessible.
I expect discussion to begin no earlier than Monday, August 23, to
allow everybody interested to subscribe.
Regards,Martin.



Character Model: Two new documents and Last Call

2004-02-25 Thread Martin Duerst
The Internationalization Working Group of the W3C is glad
to announce the publication of two new documents:
Character Model for the World Wide Web 1.0: Fundamentals
   (http://www.w3.org/TR/charmod, Last Call) and
Character Model for the World Wide Web 1.0: Normalization
   (http://www.w3.org/TR/charmod-norm, first  Working Draft).
The Last Call for the first one ends on 19 March 2004 (Friday).
The main goal of the Last Call is to make sure that we have
adequately addressed the comments we have received on the
previous Last Call. But if you spot any problems, please make
sure you tell us via the form or the mailing list indicated in
the document. Also, please feel free to forward this
announcement to other interested parties.
Regards,   Martin.

P.S.: Please apologize if you receive multiple copies of this email.

#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-#  mailto:[EMAIL PROTECTED]   http://www.w3.org/People/D%C3%BCrst


Re: Fwd: Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst
Hello Peter,

At 13:25 03/12/07 +0100, Peter Jacobi wrote:
Dear Doug, All,

> BTW, your "Unicode test page" is marked:
>   content="text/html; charset=ISO-8859-1">
This is of course redundant as this is the HTTP default.
Well, the HTTP spec unfortunately still says so, but the
HTML spec (and we are dealing with HTML here) disagrees,
and so does practice (if you look farther than just
Western Europe).

The heading 'Unicode' means the logical content, not the
encoding. The Tamil content is given as hex NCRs.
That's perfectly okay, of course.


> while your TSCII test page is marked "x-user-defined".

As the legacy Tamil charsets are not IANA registered, Tamil
users typically have a TSCII font set up for the display
of "x-user-defined"pages.
Why don't you do that, or get your Tamil contacts to do it?
It needs a bit of insistence (repeated checking/reminders
to the mailing list) and some patience, but otherwise is
quite easy, and would help a lot. And you have the experience
to describe how this relates to Unicode.
Regards,   Martin.



Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst
At 23:16 03/12/07 +0900, Jungshik Shin wrote:

On Sun, 7 Dec 2003, Peter Jacobi wrote:

> So, I'm still wondering whether Unicode and HTML4 will consider
>   லா
> valid and it is the task of the user agent to make the best out of it.
  I think this is valid.
I agree. It is the task of the user agent to make the best out of it,
and different user agents may currently do different things with it.
Because this is related to rendering and styling, it seems to make
sense that this is clarified in the CSS spec (either 2.1 or 3.0).

A more interesting case has to do with
W3 CHARMOD in which NFC is required/recommended (it's not yet complete
and W3C I18N-WG has been discussing it).  Consider the following case.
  ல�x0BC7;
 �x0BBE;
Because  is equivalent to U+0BCB, we couldn't use
the above if NFC is required even though in legacy TSCII encoding,
it's possible.
Yes, this is a bad idea. But there is Web technology that can do
this (see below).
The basic problem is that one has to draw the line somewhere.
Sometimes, one would for example like to color the dot on an 'i'.
In Unicode, it may theoretically be possible (with a dotless 'i'
and a 'dot above' or some such), but it wouldn't be a real 'i'
anymore.
And there is of course a slippery slope. For example, consider
the crossbar on a 't'. You can't color that, in any encoding.
But a font designer may want to do that, for some instructional
material, or may want to color all serifs in a font,...
Similar examples exist in almost any other script. For most
intents and purposes, most people are okay with what they
can and can't do, but occasionally, we come close to the
dividing line, and some of us are quite surprised. But somehow,
we have to agree on what's a character and what's only a glyph,
and we have to agree which combinations are canonically equivalent.

The same is true of Korean syllables(see below) as
Philippe pointed out.
  각
Yes. Korean is particularly difficult because it is the most
logical, well-designed script in the world. It has more
clearly identifiable hierarchical levels than any other
script. It is very difficult to agree on which level
characters should be.
As an example, the vowel pairs a/ya, o/yo, u/yu, and so on
are distinguished by changing from one small stroke to two
small strokes. A Web page for children or foreigners may
want to color these strokes separately. With the current
encoding(s) in Unicode this is not possible, but I'm sure
somebody has designed an encoding where this would be possible.
So while this does not solve Peter's immediate problem,
starting to change Unicode to color characters, glyphs,
or character parts would be an extremely slippery slope.
Working on better font technology seems to be much better
suited to do the job. And such technology actually is
already around. It's part of SVG. Chris Lilley had a
very nice example once, but it got lost in a HD crash.
Chris, any chance of getting a new example?
SVG (http://www.w3.org/Graphics/SVG/ http://www.w3.org/TR/SVG11/)
is the XML-based vector graphics format for the Web.
Here is more or less how it works (as far as I understand it):
In SVG Fonts (http://www.w3.org/TR/SVG11/fonts.html),
SVG itself is used to describe glyph shapes. This means
that all kinds of graphic features, including of course
coloring, but also animation,... are available.
But of course you don't want colors to be fixed.
So glyphs in a font, or parts of glyphs, also allow
the 'class' attribute. So you can mark glyphs or glyph
components with things such as class='accent' or
class='crossbar', and so on. The rendering of pieces
in this class can then be controlled from a CSS
stylesheet.  (I hope I got the details right.)
Regards,Martin.



Re: Transcoding Tamil in the presence of markup

2003-12-07 Thread Martin Duerst
At 23:34 03/12/07 +0900, Jungshik Shin wrote:

On Sun, 7 Dec 2003, Peter Jacobi wrote:

> There is some mixup of lang and encoding tagging, which I didn't fully
> understand.
   When lang is not explicitly specified, Mozilla resorts to 'infering'
'langGroup' ('script (group)' would have been a better term) from
the page encoding. Because UTF-8 is script-neutral, it's important to
specify 'lang' explicitly. Your page is in ISO-8859-1 so that without
lang specified, it's assumed to be in 'x-western' lagnGroup(well, Latin
script). Anyway, this behavior slightly changed recently in Windows
version (I forgot when I commited that patch, before or after 1.4)
and each Unicode block is assigned the default 'script'. The way fonts
are picked up by the Xft version of Mozilla makes it harder to do the
equivalent on Linux.
I know that font selection/composition is a terribly difficult
business, and hard work, so improving things takes time.
Starting out with certain assumptions about fonts for certain
encodings is clearly very helpful for speed. But I think that
not (correctly) rendering a character that is obviously in
one script and not in another is a bad idea.
Years ago, I developed a very flexible system that was able to
start out with the user-selected font but would use another
font if the first font wasn't able to do the job. The basic
architecture was in many ways very simple, but it took quite
some time to get it right. Once I had this basic architecture,
all kinds of neat things became very easy. For details, see
the paper from the 7th Unicode Conference at:
http://www.ifi.unizh.ch/groups/mml/people/mduerst/papers/PS/FontComposition. 
ps.gz

Regards,Martin.



AddDefaultCharset considered harmful (was: Mojibake on my Web pages)

2003-09-25 Thread Martin Duerst
Hello Doug, others,

Here is my most probable explanation:
Adelphia recently upgraded to Apache 2.0. The core config file (httpd.conf)
as distributed contains an entry
AddDefaultCharset iso-8859-1
which does what you have described. They probably adopted this
because the comment in the config file suggests that it's important.
I have just filed a bug with bugzilla, asking that this default
setting be removed or commented out, and the comment fixed, at
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23421. You may
want to vote for that bug.
I have also commented on a related bug that I found, at
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14513.
I suggest you tell your Internet provider:
1) that they change to AddDefaultCharset Off
   (or simply comment this out)
2) that they make sure you get FileInfo permission in your directories,
   so that you can do the settings you know you are correct.
The comment in the config file contains mostly very strange statements:


#
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset ISO-8859-1
>>>
If anybody knows something about these security issues, please
tell me (any mention of security issues usually has webmasters
in control, for good reasons).
Regards,   Martin.



At 22:40 03/09/22 -0700, Doug Ewell wrote:
Apologies in advance to anyone who visits my Web site and sees garbage
characters, a.k.a. "mojibake."  It isn't my fault.
Adelphia is currently having a character-set problem with their HTTP
servers.  Apparently they are serving all pages as ISO 8859-1 even if
they are marked as being encoded in another character set, such as
UTF-8.

If you manually change the encoding in your browser to UTF-8, or
download the page and display it as a local file, everything looks fine
because Adelphia's server is no longer calling the shot.  Their tech
support people acknowledge that the problem is at their end and said
they would look into it.
I understand that having the "Unicode Encoded" logo on my page next to
these garbage characters may not reflect well on Unicode, especially to
newbies.  I'm considering putting a disclaimer at the top of my pages,
but I'm waiting to see how quickly they solve the problem.
-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Language Tag Registrations

2003-06-02 Thread Martin Duerst
Hello Marion,

IANA won't ask your question. They are just the record keeper,
they don't make any decisions.
If you have a need for identifying a particular kind of language,
then what you do is that you submit a registration proposal.
Others will then comment on that proposal. If you don't have
an actual need for tagging, then you your question is irrelevant
for this list. If you don't make a registration proposal, then
you question is again irrelevant for this list.
Regards,  Martin.

At 18:52 03/05/31 +0100, Marion Gunn wrote:
Dear "IANA" <[EMAIL PROTECTED]>, we wish to ask whether the following is a
legitimate question for your registry, which people here believe it is:
>What, then, is the code for the English of 'Northern Ireland'?
>(GB+NI=UK.)
Since Ulster, as "IANA" <[EMAIL PROTECTED]> knows, is divided by an
international border, is the logical reply 'encode Ulster English
separately for each side of the border'? Is Basque separately 'lang-tagged'
for ES and FR?
We ask, because we do not know, and if you do not know either, that is
okay, and we wish you well in bringing all queries to harmonious
conclusions, if possible.
mg
--
Marion Gunn * EGT (Estab.1991) * http://www.egt.ie *
fiosruithe/enquiries: [EMAIL PROTECTED] * [EMAIL PROTECTED] *
___
Ietf-languages mailing list
[EMAIL PROTECTED]
http://www.alvestrand.no/mailman/listinfo/ietf-languages




[IRI] new mailing list: public-iri@w3.org

2003-04-03 Thread Martin Duerst
[Appologies if this announcement reaches you multiple times.]

Internationalized Resource Identifiers (IRIs): new mailing list

To complete the discussion on Internationalized Resource Identifiers
(IRIs) [1] and move it to Proposed Standard, we have created a new,
dedicated mailing list. This is based on discussion at the recent
BOF on URIs [2] at the IETF in San Francisco.
If you are interested in IRIs, please subscribe to the list
by sending a mail to [EMAIL PROTECTED] with "subscribe"
(without the quotes) as the subject, or just click below on [3].
The mailing list is publicly archived at [4]. Please wait for
discussion until next week to give everybody a chance to sign up.
Please note that the first time you send a mail, you may be asked
to confirm your mail via a web page. This serves as spam protection
and to make sure that you understand that your mail is publicly
archived.
I have also created an issues list at [5] where I will track
the discussion.
Regards,Martin.

[1] http://www.ietf.org/internet-drafts/draft-duerst-iri-03.txt
[2] http://www.ietf.org/ietf/03mar/uribof.txt
[3] mailto:[EMAIL PROTECTED]
[4] http://lists.w3.org/Archives/Public/public-iri/
[5] http://www.w3.org/International/iri-edit/#Issues



Re: [REPOST, LONG] XML and tags (LONG) - SCSU for XML

2003-02-21 Thread Martin Duerst
At 11:24 03/02/21 -0800, Markus Scherer wrote:
Marco Cimarosti wrote:
BTW, would it be possible to encode XML in SCSU?
Yes. Any reasonable SCSU encoder will stay in the ASCII-compatible 
single-byte mode until it sees a character from beyond Latin-1. Thus the 
encoding declaration will be ASCII-readable.
I think there are various different issues here:

- Would it be possible to *en*code an XML document in SCSU?
  The answer is clearly yes.
- Would it be *possible* to have such documents *de*coded by an XML
  processor according to the rules in Appendix F of the XML Recommendation
  (i.e. no external encoding information, such as in a standalone file).
  The answer to this question is what Markus said above.
- Is it *probable* that an XML processor decodes XML in SCSU?
  No, XML processors are only required to support UTF-8 and UTF-16.
  Many of them support other encodings, such as iso-8859-1,..., but
  support for SCSU is thin as far as I'm aware.
Regards,Martin.




Re: BOM's at Beginning of Web Pages?

2003-02-21 Thread Martin Duerst
At 13:14 03/02/18 -0800, Jonathan Coxhead wrote:

   That's a very long-winded way of writing it!

   How about this:

  #!/usr/bin/perl -pi~ -0777
  # program to remove a leading UTF-8 BOM from a file
  # works both STDIN -> STDOUT and on the spot (with filename as 
argument)
  s/^\xEF\xBB\xBF//s;

which uses perl's -p, -i and -0 options to the same effect.

Yes, indeed, that's perfect. I was thinking about something
in that direction, but never got to check it out, sorry.

Regards,   Martin.




Re: BOM's at Beginning of Web Pages?

2003-02-17 Thread Martin Duerst
Some comments:

- If you can avoid it, don't use a BOM at the start of an UTF-8
  HTML file. It will display nicely on more browsers.

- The W3C Validator http://validator.w3.org/ accepts the BOM for
  HTML 4.01, and also XHTML. It probably should produce a warning.
  It did when I originally added code to handle it. I have requested
  that it be added again.

- Adding a BOM/ZWNBSP to the whitespace definition is a bad idea,
  because it would allow a ZWNBSP in all kinds of places where
  not seeing a space would be confusing (e.g. between attributes).
  Also, HTML 4 is only being maintained, not being developed.

- That HTML 4.0 allows ZWSP (​) as whitespace in
  http://www.w3.org/TR/REC-html40/struct/text.html#h-9.1 is for
  line breaking/rendering reasons (Thai), within element content.
  This is in conflict with the whitespace definition for syntactic
  purposes, which is formally given at
  http://www.w3.org/TR/REC-html40/sgml/sgmldecl.html and does
  not include ZWSP (​). I have filed a request for
  clarification.

- RFC 2279 does not approve or disapprove of the BOM. Both Unicode
  and ISO 10646 allow the BOM as a signature for UTF-8. RFC 2079
  is being updated. See
  http://lists.w3.org/Archives/Public/ietf-charsets/2003JanMar/0209.html.

- For XML, a BOM at the start of UTF-8 is allowed by an erratum at
  http://www.w3.org/XML/xml-V10-2e-errata#E22. But similar to HTML,
  better to not start your XML files with a BOM, because there are
  XML parsers out there that don't like it (and this was okay at
  least until 2001-07-25).

- The BOM is both rather handy in a Windows/Notepad scenario and
  seriously disruptive in an Unix-like filter scenario (which may
  also be on Windows). I have found that Notepad doesn't need the
  BOM to detect that a file is UTF-8 if it has enough other information
  (this is on a Japanese Win2000, your milage may vary). It would be
  nice if it had a setting to not produce a BOM.

- I append a small perl program that removes an UTF-8 BOM if there
  is one. Quite handy, I use it regularly. Feel free to use and change
  on your own responsibility.
  (i.e. if starts to eat up your files, don't blame me!)

Regards,   Martin.




#!/usr/bin/perl

# program to remove a leading UTF-8 BOM from a file
# works both STDIN -> STDOUT and on the spot (with filename as argument)

if ($#ARGV > 0) {
print STDERR "Too many arguments!\n";
exit;
}

my @file;   # file content
my $lineno = 0;

my $filename = $ARGV[0];
if ($filename) {
open BOMFILE, "$filename";
while () {
if (!$lineno++) {
s/^\xEF\xBB\xBF//;
}
push @file, $_ ;
}
close BOMFILE;
open NOBOMFILE, ">$filename";
foreach $line (@file) {
print NOBOMFILE $line;
}
close NOBOMFILE;
}
else {  # STDIN -> STDOUT
while (<>) {
if (!$lineno++) {
s/^\xEF\xBB\xBF//;
}
push @file, $_ ;
}
foreach $line (@file) {
print $line;
}
}




RE: glyph selection for Unicode in browsers

2002-10-08 Thread Martin Duerst

At 13:41 02/10/02 +0900, Martin Duerst wrote:

>I'm not sure this is possible with Apache, maybe there is a need
>for a RemoveCharset directive similar to RemoveType
>(http://httpd.apache.org/docs/mod/mod_mime.html#removetype).
>Or maybe there is some other way to get the same result.
>If a new directive is desirable, then let's try to hack
>the Apache code or to propose it to the Apache people.
>Similar of course for other server implementations.

Over lunch, a colleague told me that RemoveCharset has
been added to Apache 2.0. See e.g.
http://httpd.apache.org/docs-2.0/mod/mod_mime.html#removecharset.

So the right thing to do may be to ask your ISP to
upgrade to Apache 2.0.

Regards,Martin.





Call for participation: I18N Activity WG Task Forces

2002-10-03 Thread Martin Duerst

Dear Unicoders,

As announced at the International Unicode Conference in San Jose
the W3C Internationalization Activity has recently been restructured,
and the Internationalization Working Group (WG) and Interest Group (IG)
have been re-chartered. We are sure that this will provide you with
increased possibilities to contribute to Web Internationalization
in the widest sense, and are looking forward to your participation.

More information can be found at
 http://www.w3.org/International/about
and in the WG charter
 http://www.w3.org/2002/05/i18n-recharter/WG-charter.

The Working Group now consists of three Task Forces (see below)

- The Core Task Force is continuing previous work: completing the
   Character Model for the World Wide Web and the Internationalized
   Resource Identifiers (IRIs) specifications, and continuing reviewing
   specifications produced by other W3C Working Groups.
   The chair of the WG and the Core Task Force is Misha Wolf, Reuters.
   See http://www.w3.org/International/core for how you can participate.

- The Web Services (WS) Task Force is investigating the needs and problems
   in the area of internationalization of Web Services, in particular the
   dependency of Web Services on language, culture, region, and locale-related
   contexts.
   The chair of the Web Services Task Force is Addison Phillips, webMethods.
   See http://www.w3.org/International/ws for how you can participate.

- The GEO (Guidelines, Education & Outreach) Task Force is helping to get
   the internationalization aspects of W3C technology better understood and
   more widely and consistently used.
   The chair of the GEO Task Force is Richard Ishida, W3C. See
   http://www.w3.org/International/geo/howto-join-geo for how you
   can participate.

Regards,   Martin.







RE: glyph selection for Unicode in browsers

2002-10-01 Thread Martin Duerst

At 12:14 02/10/01 -0400, [EMAIL PROTECTED] wrote:

>I agree that 'sniffing' and 'guessing' are ill-defined, and not to be
>relied upon.  However, I find it a bit 'ill-defined' that there is no
>well-defined (web server independent) way for the 'users' to override
>the possibly wrong encoding default of the web server.  Either way
>(a) the user has to do something web server dependent
>(b) the admin has to do changes to the site config
>seems a bit clunky and fragile.
>
>Since the current "resolving order" is obviously already deployed out
>there and relied upon by someone, it cannot be changed, but possibly
>something new could be introduced?

Well, servers can always be improved by the various server implementers.
What standards specify is what goes 'over the wire'.

The only thing you actually have to do is to make sure that the server
doesn't add a 'charset' parameter to the Content-Type header for
the directories you are using. Then the  is the only info,
and is used by the browser.

I'm not sure this is possible with Apache, maybe there is a need
for a RemoveCharset directive similar to RemoveType
(http://httpd.apache.org/docs/mod/mod_mime.html#removetype).
Or maybe there is some other way to get the same result.
If a new directive is desirable, then let's try to hack
the Apache code or to propose it to the Apache people.
Similar of course for other server implementations.

Regards,Martin.






RE: glyph selection for Unicode in browsers

2002-09-30 Thread Martin Duerst

At 07:37 02/09/26 +0900, [EMAIL PROTECTED] wrote:
>I would be happy if just this
>
>
>
>would be enough to convince the browsers that the page is in UTF-8...
>It isn't if the HTTP server claims that the pages it serves are in
>ISO 8859-1.  A sample of this is http://www.iki.fi/jhi/jp_utf8.html,
>it does have the meta charset, but since the webserver (www.hut.fi,
>really, a server outside of my control) thinks it's serving Latin 1,
>I cannot help the wrong result.  (I guess some browsers might do better
>work at sniffing the content of the page, but at least IE6 and Opera 6.05
>on Win32 seem to believe the server rather than the (HTML of the) page.

Sniffing isn't a good idea in the long term. It may work
for simple web page serving, but as soon as you go XML and
start to move data around without the user having a chance
to see it frequently, you'll end up with a big mess.

Also, 'guessing' is very ill-defined. You might serve
a document to your favorite browser, and it looks okay.
But other browsers might guess a bit differently, or
a new version of your favorite browser may guess a bit
differently, and off you are.

Regards,   Martin.




Re: browsers and unicode surrogates

2002-04-23 Thread Martin Duerst

Just a very small correction:

At 07:19 02/04/22 -0400, James H. Cloos Jr. wrote:

>There are other ways as well.  Apache will already (if you use the
>default configs) add the Content-Language header if you use a filename
>like foo.en.html.  You could have it also add the charset via a
>similar mechanism.  Something like:
>
>AddCharset UTF-8 utf8
>
>will make foobar.en.utf-8.html send the headers:

This should of course be foobar.en.utf8.html

(or you can extend it to the extension utf-8 by saying

AddCharset UTF-8 utf8 utf-8


>Content-Language: en
>Content-Type: text/html; charset=UTF-8

Regards,   Martin.




Re: browsers and unicode surrogates

2002-04-23 Thread Martin Duerst

At 22:25 02/04/19 +0100, Steffen Kamp wrote:
>However, when giving the validator a ASCII-only document with a META tag
>specifying UTF-16 as encoding (just for testing) it says that it does not
>yet support this encoding, so I don't fully trust the validator in this case.

The validator indeed doesn't yet support UTF-16. It's on the
to-do list.

Regards,   Martin.




New I-D for Internationalized Resource Identifiers

2002-04-17 Thread Martin Duerst

Dear Unicoders,

I have just submitted draft-w3c-i18n-iri-00.txt to the Internet Drafts
editor. This draft replaces draft-masinter-url-i18n-08.txt. It should be
published in a few hours/days. In the mean time it is available at
http://www.w3.org/International/2002/draft-w3c-i18n-iri-00.txt.

Based on discussions at the W3C Technical Plenary in February, and in
particular on input from Larry Masinter, we have made some changes in
the responsibilities for the Internationalized Resource Identifiers
(IRI) draft, as follows:

- The W3C I18N WG is taking on responsibility for carefully
   reviewing the current draft and bringing it to maturity for
   submission to the IESG.

- Larry is glad to step down as a co-editor, and Michel Suignard
   has volunteered to become a new co-editor. Many thanks to Larry
   for his work as co-author of many earlier versions of this document.

This has resulted in the name change. The document will still be handled
as an individual submission from the point of view of the IETF. We hope
to take this document to IETF/W3C Last Call in May, after some more work.

Please review draft-w3c-i18n-iri-00.txt and send comments to
[EMAIL PROTECTED] (publicly archived at
http://lists.w3.org/Archives/Public/www-i18n-comments/).


Regards,Martin.





Re: xml 1.0 and unicode ideograph ext a and ext b

2002-04-06 Thread Martin Duerst

Hello Yung-Fong,

First, please send potential error reports to [EMAIL PROTECTED]
as indicated in the spec. Second, as somebody else has already
said, the XML Core WG is working on extending the repertoire of
XML Names in XML Blueberry / XML 1.1.

If you have any specific comments, I suggest you send them to
the W3C I18N IG for discussion.

Regards,   Martin.

At 14:20 02/04/03 -0800, Yung-Fong Tang wrote:
>dear XML editors:
>
>When our QA engineers try to verify our XML support with characters from 
>Unicode Ideograph Ext A and Ext B block , we found the following issue
>http://bugzilla.mozilla.org/show_bug.cgi?id=134963
>
>basically, according to the XML 1.0 2nd edition speciifcation, only 
>BaseChar and Ideograph could be used as Letter, but none of the characters 
>from Unicode Ideograph Ext A block and Unicode Ideograph Ext B block are 
>listed as Ideograph. Therefore, xml which use use characters from these 
>two block for element name are consider not well-formed now.
>
>Any plan to change the xml specification to follow the newly updated 
>Unicode 3.2 Standard ?
>
>
>





Rechartering the W3C I18N Activity

2002-03-05 Thread Martin Duerst

Dear Unicoders,

W3C organized a workshop co-located with the 20th
Unicode Conference last month in Washington DC,
to discuss the future of the W3C Internationalization
Activity.

The minutes and results of the workshop are now published at
http://www.w3.org/2002/02/01-i18n-workshop
http://www.w3.org/2002/02/01-i18n-workshop/minutes
http://www.w3.org/2002/02/01-i18n-workshop/consensus

The plan now is to use the month of March for a concentrated effort
to collect additional material. At the end of March, we will
prepare the new charter(s) for the W3C Internationalization Activity
based on the input we receive.

The workshop identified five work streams of potential
future work:

- Guidelines, best practices
- Distributed services (e.g. exchanging locale/collation info)
- Education & Outreach
- Localizability
- Existing work (reviews, character model, liaisons)

We would like to invite you to participate and contribute
in helping shape the future of the W3C Internationalization
Activity. For this, please join the [EMAIL PROTECTED]
mailing list, archived at
http://lists.w3.org/Archives/Public/www-i18n-workshop/.;
[this archive is publicly accessible]

To subscribe, send a mail with subject 'subscribe' to
[EMAIL PROTECTED]
(clicking the following link should do the job:
mailto:[EMAIL PROTECTED]?Subject=subscribe).

The workshop was a very good start for planning the new activities,
but more is needed. In particular, for each work stream,
we need to get:

- A more detailed list of work items (including priorities,
   effort needed)

- Information on business cases, use cases,... to justify
   why W3C should do this work (e.g. how would this improve the Web?)

- A list of experts and key players

Please don't hesitate to send your ideas to [EMAIL PROTECTED]
Please use prefixes (e.g. Guidelines, Distributed, Education,
Localizability, Existing) in mail subjects.


Looking forward to hearing from you soon,Martin.

#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-#  mailto:[EMAIL PROTECTED]   http://www.w3.org/People/D%C3%BCrst





Re: All-kana documents

2002-03-05 Thread Martin Duerst
Character-based compression schemes have been suggested by others.
But this is not necessary, you can take any generic data compression
method (e.g. zip,...) and it will compress very efficiently.

The big advantage of using generic data compression is that
it's already available widely, and in some cases (e.g. modem dialup,
some operating systems, web browsers) it is already built in.
The main disadvantage is that it's not as efficient as specialized
methods for very short pieces of data.

Regards,Martin.

At 18:47 02/03/04 -0500, $B$m!;!;!;!;(B $B$m!;!;!;(B wrote:
>If I have some all-kana documents (like, say, if I decide to encode some 
>old women's literature, not that I will, but you might), is there an 
>extension of UTF-8 that will alow me to strip off the redundant "this is 
>kana" byte from most of the kana? After the first few thousand kana, it 
>might be like, "Yeah, we get it already! It's kana! It's KANA!! You can 
>stop reminding us now!!"
>
>This goes too for Hebrew, Greek, etc.
>
>$B==0l$A$c$s!!0&2CMvGO(B


barcodes (was: RE: Passing non-english character params in URL)

2002-02-05 Thread Martin Duerst

Hello Brant,

This is not really a Web internastionalization question.
Therefore I'm forwarding it to the unicode mailing list.

Regards,Martin.

At 08:48 02/02/05 -0500, IDAutomation.com, Inc. wrote:
>I am hoping you can help me with a FileMaker task. We sell barcode fonts and
>we have several utilities on the PC side that you send a string to and it
>returns another string to add checksums to the string.
>
>Do you know if it is possible to create a FileMaker plug-in that formats
>barcode fonts in UNICODE to the PC and MAC in the same way?
>
>I am hoping that one single plug-in for FileMaker will work the same on the
>MAC and the PC with UNICODE font formatting. Do you know this to be true?
>Our fonts now all support UNICODE.
>
>
>Best Regards,
>
>Brant Anderson
>IDAutomation.com, Inc.
>"Your Source for Quality Symbology"
>http://www.IDAutomation.com/
>(Member Better Business Bureau)
>
>BARCODE PRODUCTS:
>ASP Server: http://www.idautomation.com/asp/
>Fonts: http://www.bizfonts.com/
>ActiveX: http://www.idautomation.com/activex/
>Java: http://www.idautomation.com/java/
>Scanners: http://www.idautomation.com/scanners/
>ORDER NOW: http://www.idautomation.com/sitemap/





Re: New plane 1 page for testing your browsers

2002-01-17 Thread Martin Duerst

At 21:44 02/01/06 -0800, James Kass wrote:

>Martin Duerst wrote,

> > (I wrote,)
> > It would be perfectly correct and might even allow the page to
> > sport one of those "valid-HTML" gifs from W3.
>
>But it doesn't.  Just tried changing the charset on an NCR Deseret test
>page from UTF-8 to US-ASCII.  Both charsets fail on the W3 validator
>because the NCRs are out of any recognized range.

This bug has now been fixed in the test version. If you use

http://validator.w3.org:8188/,
it will validate correctly. The NCR errors are also gone from
Tex's page, but that page still has quite a few other problems.

If somebody has a page with plane-1 characters in hexadecimal
NCRs, please check it, too, and tell me whether the validator
works for these or not.

Regard,Martin.




Reminder: Jan 10: W3C I18N Workshop deadline

2002-01-07 Thread Martin Duerst

Dear Unicoders,

The deadlines for registrations and submissions for the
W3C Internationalization workshop are approaching rapidly;
please make sure you don't miss them.

Registration deadline:
   January 10th, 2002 (Thursday)
(see http://www.w3.org/2002/02/01-i18n-workshop/cfp#registration)

Deadline for position statements:
   January 10th, 2002 (Thursday)
(see http://www.w3.org/2002/02/01-i18n-workshop/cfp#position)
Please make sure you don't forget to send in your position
statement.

The workshop will be held as follows.

Date: 1 February 2002
Location: Omni Shoreham Hotel, Washington DC, USA
(just after the 20th International Unicode Conference)

Hotel rooms are still available at the special conference rate,
but please reserve as soon as possible.
(see http://www.w3.org/2002/02/01-i18n-workshop/cfp#Venue)

The full Call for Participation, at
http://www.w3.org/2002/02/01-i18n-workshop/cfp, contains additional
information about goals and scope.

Please note: while this workshop is open, there is an attendence limit
of 45; preference will be given based on

1. quality of position statements and
2. W3C Membership.






Re: New plane 1 page for testing your browsers

2002-01-06 Thread Martin Duerst

At 00:05 02/01/04 -0500, Tex Texin wrote:
>Thanks to James Kass, we have a new version of the Unicode examples for
>plane 1, that uses UTF-8, instead of NCRs.
>
>So the following link is to the original page that is code page
>"x-user-defined" and uses NCRs for supplementary characters:
>
>http://www.geocities.com/i18nguy/unicode-example-plane1.html

Hello Tex,

I don't understand why this page uses x-user-defined as a charset.
Labeling it as US-ASCII would be perfectly correct.
The 'charset' only applies to the binary encoded characters, not
to NCRs.

Regards,   Martin.



>It contains a link to the following page, which is code page "UTF-8" and
>uses UTF-8 for supplementary characters:
>
>http://www.geocities.com/i18nguy/unicode-plane1-utf8.html
>
>The text in the tables should be the same on both pages, but your
>browser mileage may (more likely will) vary.
>
>Neither Netscape or IE support supplementary UTF-8.
>I haven't tried it but James reports that Opera 6.0 displays the Gothic
>fine, the Etruscan is backwards (so the right to left direction is not
>working) and the Deseret is not yet supported. Still, Opera leads the
>pack. I would be glad to receive reports on other browsers and will
>update the page as time allows.
>
>tex
>
>
>Tex Texin wrote:
> >
> > Otto Stolz wrote:
> >
> > > I have tested three browsers with the example page
> > > ,
> > > which comprises NCRs of plane 1 characters.
> >
> > Thanks Otto. There certainly would be value in creating an identical
> > page in UTF-8 rather than NCRs, for further testing, and I intend to
> > provide one, I just haven't had the time. Actually, it started out as
> > utf-8 and became ncr's when we found that is what worked for IE.
> >
> > Maybe over the weekend...
> > tex
> > --
> > -
> > Tex TexinDirector, International Business
> > mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271
> > the Progress Company Fax: +1-781-280-4655
> > -
> > For a compelling demonstration for Unicode:
> > http://www.geocities.com/i18nguy/unicode-example.html
>
>--
>-
>Tex TexinDirector, International Business
>mailto:[EMAIL PROTECTED]Tel: +1-781-280-4271
>the Progress Company Fax: +1-781-280-4655
>-
>For a compelling demonstration for Unicode:
>http://www.geocities.com/i18nguy/unicode-example.html





Ruby (was: Re: Vertical scripts)

2001-12-26 Thread Martin Duerst
At 17:30 01/12/25 -0800, Michael (michka) Kaplan wrote:
>From: "$BAk]namdqor(B $BDialamt_dgr"(B <[EMAIL PROTECTED]>
>
> > By the way, does any browser in common use
> > support the Ruby extensions to HTML?

The 'ruby extensions for HTML' are defined in
http://www.w3.org/TR/ruby/, a W3C recommendation.


>Well, looking at links like:
>
>http://msdn.microsoft.com/workshop/author/dhtml/reference/objects/rt.asp
>
>(all on one line) and just doing a random search on
>http://msdn.microsoft.com/ for keywords like "HTML Ruby" make me think its
>supported in IE5 and later?

As far as I know, IE5 and later support the 'simple ruby markup'
(including parentheses) defined in the Recommendation
(see http://www.w3.org/TR/ruby/#simple-ruby1), but do
not support complex ruby markup.
[I'd be very glad to be corrected.]

Regards,   Martin.


Re: Tategaki (was: Re: Updated Compelling Unicode Demo)

2001-12-25 Thread Martin Duerst
At 22:26 01/12/25 +0200, $BAk]namdqor(B $BDialamt_dg(B wrote:
>* [EMAIL PROTECTED] [2001-12-24 04:55]:
> >
> > Also, when will we have a stylesheet code for
> > tategaki? They should not use "ideographic" in the
>
>"Tategaki" is simply vertical writing, right? Is this another word that
>should be added to the jargon of character set encoding, like
>"mojibake"? Probably not...
>
> > There should be a "tategaki-ltr" and a "tategaki-rtl",
> > depending on which line of characters is read first.
> > Which way is Mongolian?

For what's in the works for CSS3, please see
http://www.w3.org/TR/css3-text/#PrimaryTextAdvanceDirection.

For XSL (a W3C Recommendation), please see
http://www.w3.org/TR/xsl/slice7.html#writing-mode.

Regards,   Martin.


Re: Rush request for help!

2001-12-23 Thread Martin Duerst

I agree with Jungshik that U+76F4 (straight) is possibly the
case where unification went farthest in the sense that it's
the case where average modern readers in various areas might
be most (1) confused if they see the glyph variant they are
not used to.

(1) 'most confused' should not be misunderstood to be very
confused; for most other cases cited often, e.g. the grass
radical, the bone radical/character,..., it's difficult for
the users to recognize the difference in running text.

If you want to give an example of where 'unification fails',
then you have to really be precise and speak about the
application case. In general, unification just works, that's
how it has been designed.

Regards,   Marti.n

At 16:59 01/12/21 -0500, Jungshik Shin wrote:
>On Fri, 21 Dec 2001, Suzanne M. Topping wrote:
>
> > The two examples I dug out of various ongoing email debates etc. are
> > below:
> >
> >   The traditional Chinese glyph for "grass" uses four
> >   strokes for the "grass" radical, whereas the simplified Chinese,
> > Japanese,
> >   and Korean glyphs use three. But there is only one Unicode point
> > for the
> >   grass character (U+8349) regardless of writing system.
>
>   I can't say about Japanese or simp. Chinese, but in Korea
>the number of strokes for the 'grass' radical (U+8279) is sometimes three
>and other times four. Every Korean with the minimum knowledge of Chinese
>characters knows that it can have either four strokes or three strokes.
>It's just a matter of taste of font designers, individuals, etc.  However,
>four stroke version is more common and I think is considered 'canonical'.
>Personally, I always (well, like most other Koreans, I rarely use Chinese
>characters with a pen/pencil these days) use four strokes.
>(ref. http://211.46.71.249/handic/index.htm)
>
>   I've just found that there are three grass radicals
>encoded in CJK Radicals supplement block: U+2EBE, U+2EBF, U+2EBF
>in addition to the _full_ form at U+2F8B.
>
>
> >   Another example is the ideograph for "one," which is different
> > in
> >   Chinese, Japanese, and Korean.
>
>   Did you mean U+58F9 ?
>
>
> > I have been told that neither of these are valid examples, for various
> > reasons.
>
>   I guess so.
>
>
> > I very much want to include a legitimate example of a character which
> > displays using different glyphs in various character sets, and am hoping
> > that one of you brilliant people out there can send me one ASAP, so I
> > can finish this blasted paper and go home to grab a glass of eggnog.
>
>   I'm sorry I can go to a great length to show why CJK unification is not
>so much a problem as some people have tried to make it look
>(e.g. ,
>  or  
>), it's rather hard to give an example you need.
>
>   One potential candidate is U+76F4 (straight). I can't say that the glyph
>given for U+76F4 in TUS 3.0 code chart (on p.677) is familiar
>to me (although I have little problem recognizing it) . A few years
>ago a Japanese sent me an ASCII rendering (as in TUS 3.0 or 2.0) and
>challenged me whether I could recognize it. He cited it as an example of
>inappropriate unification of traditional Chinese character and Japanese
>Kanji. However, this may not be an valid example, either. If you look it
>up at http://140.111.1.40 (Chinese character variant dictionary compiled
>under the auspice of MoE of Taiwan), the canonical glyph listed for the
>character is the same as the glyph on p. 886 of TUS 3.0 (the radical of
>the character meaning 'straight' is the character meaning 'eye' U+76EE
>with three additional strokes) which is certainly more familiar to me
>and, I believe, to Japanese.  Therefore, his assertion that traditional
>Chinese glyph for U+76F4 is different from Japanese counterpart does not
>hold up. As Thomas Chan once remarked on this list, Japanese objection to
>CJK unification would have been much smaller if more canonical
>glyphs (as listed in the aforementioned variant dictionary) have been
>used in TUS 3.0 table.
>
>   Jungshik Shin
>
>





Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst

As the person who implemented UTF-8 checking for http://validator.w3.org,
I beg to disagree. In order to validate correctly, the validator has
to make sure it correctly interprets the incomming byte sequence as
a sequence of characters. For this, it has to know the character
encoding. As an example, there are many files in iso-2022-jp or
shift_jis that are prefectly valid as such, but will get rejected
by some tools because they contain bytes that correspond to '<' in
ASCII as part of a doublebyte character.

So the UTF-8 check is just to make sure we validate something
reasonable, and to avoid GIGO (garbage in, garbage out).
Of course, this cannot be avoided completely; the validator
has no way to check whether something that is sent in as
iso-8859-1 would actually be iso-8859-2. (humans can check
by looking at the source).

Regards,  Martin.

At 12:26 01/12/14 -0800, James Kass wrote:
>There is so much text on the web using many different
>encoding methods.  Big-5, Shift-JIS, and similar encodings
>are fairly well standardised and supported.  Now, in addition
>to UTF-8, a web page might be in UTF-16 or perhaps even
>UTF-32, eventually.  Plus, there's a plethora of non-standard
>encodings in common use today.  An HTML validator should
>validate the mark-up, assuring an author that (s)he hasn't
>done anything incredibly dumb like having two 
>tags appearing consecutively.  Really, this is all that we should
>expect from an HTML validator.  Extra features such as
>checking for invalid UTF-8 sequences would probably be most
>welcome, but there are other tools for doing this which an
>author should already be using.
>
>Best regards,
>
>James Kass.
>





Re: Clean and Unicode compliance

2001-12-16 Thread Martin Duerst

At 07:16 01/12/14 -0800, James Kass wrote:
>Having an HTML validator, like Tidy.exe, which generates errors
>or warnings every time it encounters a UTF-8 sequence is
>unnerving.  It's especially irritating when the validator
>automatically converts each string making a single UTF-8
>character into two or three HTML named entities.

This is really bad. Have you made sure you have the right
options? Tidy has a lot of options.


Regards,   Martin.




Re: HTML Validation (was Re: Clean and Unicode compliance)

2001-12-16 Thread Martin Duerst

Hello James (and everybody else),

Can you please send comments and bug reports on the validator to
[EMAIL PROTECTED]? Sending bug reports to the right address
seriously increases the chance that they get fixed.

Regards,  Martin.

At 14:46 01/12/16 -0800, James Kass wrote:

>Elliotte Rusty Harold wrote,
>
> >
> > I suspect a lot of our tools haven't been thoroughly tested with
> > PLane-1 and are likely to have these sorts of bugs in them.
>
>Since Plane One is still fairly new, this is understandable.
>
>I'm also having trouble getting Plane Zero pages to validate.
>
>Spent several hours revising some of my pages as a result of
>some kindly off-list suggestions.  (Most of the pages on my site
>were rewritten to pass Tidy.exe long ago, and apparently were
>already correct.)  After getting the revised pages to pass the
>Tidy validator (which is also from w3), it was a big surprise
>that the first four pages checked with the W3 validator failed
>to pass.
>
>Amazingly, some pages didn't pass because " wasn't recognized
>as a valid named entity.
>
>After tidy warns that 

W3C Internationalization Workshop

2001-12-10 Thread Martin Duerst

Dear Unicoders,

W3C is holding a workshop on Internationalization to evaluate
the work over the last years and decide on new directions
(in particular guidelines and outreach). Details are
as follows:

Date: 1 February 2002
Location: Omni Shoreham Hotel, Washington DC, USA

The Call for Participation is on the Web at:

http://www.w3.org/2002/02/01-i18n-workshop/cfp.html

The full Call for Participation contains information about registration
requirements and procedures, and a link to the online registration form.
The deadlines for this workshop are:

 Position statements due: 10 January 2002
 Registration closes: 10 January 2002

This event is co-located with the 20th International Unicode Conference.
Please note: while this workshop is open, there is an attendence limit
of 45; preference will be given based on

1. quality of position statements and
2. W3C Membership.


Looking forward to your participation, Martin.





Comments on draft-masinter-url-i18n-08.txt, please

2001-12-09 Thread Martin Duerst

Dear Unicoders,

http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt
about the internationalization of URIs (called IRIs) has recently
been updated and published.

This has been around for a long time, but we plan to move ahead with it
in the very near future. Please have a look at the document,
and send me any comments that you have soon.

Many thanks in advance, Martin.


A New Internet-Draft is available from the on-line Internet-Drafts directories.


Title   : Internationalized Resource Identifiers (IRI)
Author(s)   : L. Masinter, M. Duerst
Filename: draft-masinter-url-i18n-08.txt
Pages   : 12
Date: 28-Nov-01

This document defines a new protocol element, an Internationalized
Resource Identifier (IRI). An IRI is a sequence of characters from
the Universal Character Set [10646]. A mapping from IRIs to URIs
[RFC 2396] is defined, which means that IRIs can be used instead
of URIs where appropriate to identify resources

A URL for this Internet-Draft is:
http://www.ietf.org/internet-drafts/draft-masinter-url-i18n-08.txt




Re: Unicode aware drawing program

2001-12-06 Thread Martin Duerst

I suggest you look at tools that in one way or another produce
SVG. SVG is based on XML and therefore supports Unicode.

Please see http://www.w3.org/Graphics/SVG/ and
http://www.w3.org/Graphics/SVG/SVG-Implementations.htm8#svgedit
and below.

Please note that not all tools may support the same range of
Unicode characters, but because SVG is XML, you can also
separate graphical drawing (in a drawing tool) and text editing
(in an Unicode-capable plain-text editor).

Regards,  Martin.


At 11:27 01/12/06 -0500, Elliotte Rusty Harold wrote:
>I need a vector based illustration program such as Visio or Adobe 
>Illustrator that can handle signiifcant numbers of non-European, 
>non-Latin-1 characters. Otherwise, my needs are very simple (Just circles, 
>squares, and text really) I've been using Visio 5, but it won't go beyond 
>Latin-1; and as near as I can tell from Microsoft's web site Visio 2002 is 
>no better in this regard. I only need to draw one figure, so I'd rather 
>not spend a fortune. I could use it on either NT, MacOS 9, or Linux. Can 
>anyone suggest a drawing program that might suit my needs?
>
>As a last resort, I suppose I could write a Java program to auto-generate 
>an SVG image, but that feels a little like using a hearse to haul dirt 
>around my farm. It would probably work, but isn't really the right tool 
>for the job. :-)
>
>--
>+---++---+
>| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
>+---++---+ 
>|   Java I/O (O'Reilly & Associates, 1999)   |
>|http://www.ibiblio.org/javafaq/books/javaio/|
>|   http://www.amazon.com/exec/obidos/ISBN=1565924851/cafeaulaitA/   |
>+--+-+
>|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/  | 
>|  Read Cafe con Leche for XML News: http://www.ibiblio.org/xml/ |
>+--+-+
>
>





RE: Planning a "Unicode Only" Week

2001-11-29 Thread Martin Duerst

It's very much working that way in any serious browsers.
Some font formats (e.g. bitmaps for XWindows on Unix)
use layouts corresponding to traditional encodings.
Truetype fonts used on many systems can be directly
accessed by Unicode, but part of the info in a conversion
table is still needed to know what characters (actually
glyphs!) a font covers.

Regards,Martin.

At 10:37 01/11/29 -0500, Suzanne M. Topping wrote:


> > -Original Message-
> > From: [EMAIL PROTECTED]
>
> > I think maybe that encoding (on the Internet) does not much
> > matter. As long as my browser knows that it is looking at
> > Unicode, it knows which, say, SJIS, character to look up in
> > the font to display. Must have table lookup or something.
>
>Now THAT's a Browser!!! What the heck are you using? Apologies for my
>incredulity, but I'd be surprized if the browser has this level of
>sophistication... I question whether it's working the way you describe,
>but I could of course be wrong.





Re: Hangul script type: (was Re: [OT] ANN: Site about scripts)

2001-10-16 Thread Martin Duerst

At 16:44 01/10/11 -0700, Kenneth Whistler wrote:

>Hangul is structured from an alphabet (the jamo). That alphabet is
>so tightly coupled to the phonology of Korean that it can be
>considered a phonemic alphabet -- it is very regularly related to
>the sound of Korean.

Oh well. I guess you never heard about all the liaisons
and things that go on with final consonants and consonant
clusters. Hangul was order-made for Korean, but Korean has
changed. In many ways, it's similar to the current state
of French. It's definitely not as nice as Finish or Italian,
but not as bad as English.

Regards,   Martin.




Re: japanese xml

2001-08-30 Thread Martin Duerst

Hello David,

What you say is true, but it affects only a very small set of
codepoints, mainly symbols. For more documentation, I recommend
to read http://www.w3.org/TR/japanese-xml/.

Regards,   Martin.

At 13:13 01/08/30 -0500, David Starner wrote:
>On Thu, Aug 30, 2001 at 09:51:24AM -0700, Addison Phillips [wM] wrote:
> > And it is worth mentioning, becuase, in fact,
> > EUC-JP (and many other encodings) are perfectly interoperablefor the
> > subset of characters that they represent.
>
>One of the big complaints I hear in trying to Unicodize Linux is that
>that EUC-JP,  Shift-JIS, and CP932 are all encodings that include
>JIS X 0208; but they all map the JIS X 0208 portion differently.
>So EUC-JP <-> Shift-JIS produces different results than EUC-JP <->
>Unicode <-> Shift-JIS. That's not prefectly interoperable.
>
>--
>David Starner - [EMAIL PROTECTED]
>Pointless website: http://dvdeug.dhis.org
>"I don't care if Bill personally has my name and reads my email and
>laughs at me. In fact, I'd be rather honored." - Joseph_Greg





RE: japanese xml

2001-08-30 Thread Martin Duerst

At 10:39 01/08/30 +0100, [EMAIL PROTECTED] wrote:

>Additionally, if you are thinking of XML (or
>HTML) then you can encode *all* Unicode characters in an EUC-encoded
>document, by employing numeric character references for characters
>outside the EUC character repertoire.  Using the same technique, you can
>encode all Unicode characters in an ASCII-encoded document.

One small clarification: Numeric character references can only be
used in content (including attribute values), but not in element/
attribute names,... So if you have a Japanese document with
ASCII-based markup (always true for HTML), or with Japanese markup
(what the question was about), euc-jp will work. However, if you
have Arabic element names, Devanagari attribute names, processing
instructions using Hangul, XML comments containing Mongolian,
or anything similar, you have to keep the document in some
Unicode-based encoding and cannot use euc-jp. Not that such things
are likely, but better be sure.

And to Marco: It's great to hear that you think that the existence
of numeric character references in XML and HTML, and the fact that
they are based on Unicode, is common knowlegde. For somebody like
Misha and me who have worked on getting us there, it may take some
more time to be convinced about that.

Regards,   Martin.




Re: japanese xml

2001-08-29 Thread Martin Duerst

There are lots of examples out there, but mostly in legacy encodings.
If you need one in an UTF, just convert it yourself (and make sure
you change or remove 'encoding="euc-jp"').
XML mandates that every processor (the receiving end) understands
UTF-8 and UTF-16, but documents can be in other encodings.

Regards,   Martin.

At 18:13 01/08/29 +1000, Viranga Ratnaike wrote:
>Hi,
>
> I was hunting for examples of japanese xml and came across the
> following, which looks rather cool.  Except that it doesn't seem
> to actually be unicode.  I thought XML had mandated unicode?
>
> http://java.sun.com/xml/jaxp-1.1/examples/samples/weekly-euc-jp.xml
>
> Are there other documents like this, but with the underlying
> encoding being one of the UTFs ?  I'd love to test my software on
> data which has japanese in the element names as well as the PCDATA.
>
>Regards,
>
> Viranga





Re: exchanging Arabic data in utf-8

2001-08-22 Thread Martin Duerst

IANA charset names have always been case-insensitive, and it
would be very strange if suddenly Microsoft (or anybody else)
made them case-sensitive.

I suggest you make sure that the header written out by the
CGI script says

Content-Type: text/html; charset=utf-8

and try again. If it doesn't work, put up an example page
on the web so that we can have a look at it.

Regards,   Martin.

At 14:19 01/08/22 -0700, Michael \(michka\) Kaplan wrote:
>From: "Iman Saad" <[EMAIL PROTECTED]>
> > I tried adding the following header in the section
> > of the cgi script that includes the html code, but that did not change
> > anything:
> >
> > 
>
>If you look at the following link (all on one line)
>
>http://msdn.microsoft.com/workshop/author/dhtml/reference/charsets/charset4.
>asp
>
>it quite specifically wants the IANA name for UTF-8, which would be utf-8
>(lowercase) rather than the uppercase you specified. Try that and see if it
>helps
>
>
>MichKa
>
>Michael Kaplan
>Trigeminal Software, Inc.
>http://www.trigeminal.com/
>
>





Re: Annotation characters

2001-07-23 Thread Martin Duerst

At 01:44 01/07/21 -0400, [EMAIL PROTECTED] wrote:
>In a message dated 2001-07-20 6:19:24 Pacific Daylight Time, [EMAIL PROTECTED]
>writes:
>
> >  You can find a better way to do furigana, and an answer to many
> >  of your questions, at http://www.w3.org/TR/ruby (the Ruby Annotation
> >  Recommendation).
>
>Patrick's original question concerned an undocumented, but arguably legal,
>way of using the Unicode interlinear annotation characters.
>
>Martin's response makes it sound as though the annotation characters have the
>Plane 14 nature: they were brought into this world with strong warnings never
>to use them, but instead to use an equivalent mechanism in HTML, XML, or some
>other higher-level protocol.

As far as I understand the history of these characters and their
current use, I think this is true. The following may give more information:
http://www.w3.org/TR/unicode-xml/#Interlinear


>TUS 3.0 says (p. 326) that the use of annotation characters is "strongly
>discouraged without prior agreement between the sender and the receiver."  Is
>this as strong a statement as the one in Unicode 3.1 concerning language
>tags, which states that they are not to be used at all except in the presence
>of specific protocols?

The language here is slightly different, and I have no idea whether
the intent was exactly the same, but in any case it seems that the
intents were very close to each other.


Regards,  Martin.




Re: Is there Unicode mail out there?

2001-07-22 Thread Martin Duerst

Sorry - By 'pattern restrictions on mixed content' I meant a
feature in XML Schema that would allow to specify that the
mixed content in certain elements is restricted by a pattern
facet. This is a feature that isn't in XML Schema, but that
has been discussed. This would allow to define that a document
does not allow C0 control characters, a feature that would
be very important for many cases if the basic XML syntax
would start to allow C0.

Regards,   Martin.

At 10:32 01/07/19 -0600, Shigemichi Yazawa wrote:
>At Thu, 19 Jul 2001 15:52:39 +0900,
>Martin Duerst <[EMAIL PROTECTED]> wrote:
> > Of course then pattern restrictions on mixed content (which we
> > currently don't have) would become really helpful.
>
>Martin,
>
>What kind of pattern restrictions are necessary by introducing C0 NCR?
>Something like this? $B
>
>---
>Shigemichi Yazawa
>[EMAIL PROTECTED]





Re: Annotation characters

2001-07-20 Thread Martin Duerst

Hello Patrick,

You can find a better way to do furigana, and an answer to many
of your questions, at http://www.w3.org/TR/ruby (the Ruby Annotation
Recommendation).

Regards,   Martin.

At 18:40 01/07/19 -0400, Patrick Andries wrote:

>Just a small question about annotation characters.
>
>If I understand p. 326 this sequence should be valid :
>
>U+723B   cat .
>
>Is this the case ?
>
>If so, does such an annotation character sequence have any application
>in Japanese typography ? In other words, does one find double ruby
>notation ? I would find it useful for kanji ignorami like me.
>
>If double furigana are sometimes used, are both annotations found
>stacked on top of one another or is one annttion found on top and the
>other at the bottom of the annotated character ?
>
>P. Andries
>
>P.-S. : I'm still interested by a definition of "in(-)line software"
>(http://www.unicode.org/unicode/reports/tr27/)". I know what inline code
>or  processing could be but I can't quite understand the relationship
>with the inline software mentioned here and processing music text.
>
>
>
>





Re: Is there Unicode mail out there?

2001-07-19 Thread Martin Duerst

I think that the right solution, if we could redo things, would
be to allow something like  in content, but to never use
the actual byte values. This would allow the data guys to stream
stuff, and could leave the document guys reasonably unconcerned.
Of course then pattern restrictions on mixed content (which we
currently don't have) would become really helpful.

Regards,   Martin.

At 08:07 01/07/18 -0700, Mark Davis wrote:
> > I wouldn't want any control codes in a database. Having a control-G
> > may be funny (the joke as I know it goes back to Don Knuth), but
> > something like a control-S is too much of a risk.
>
>*You* wouldn't want?
>
>There are a lot of characters *I* wish were not in databases, or in use at
>all. A lot of them may or may not make sense. Whether or not I want them,
>someone can have a database where they are allowed. By having this
>(inconsistent) restriction, it simply means I can't be guaranteed full
>round-tripping  from databases to XML and back, no matter what their
>content.
>
>Of course, this is not a huge restriction -- it is simply a gratuitous
>annoyance. One could even live with something much more onerous, say XML
>disallowing all characters whose code points were divisible by 4321 -- just
>have complicated DTDs and shift into base64 if you encounter any of those
>codes.





Re: Is there Unicode mail out there?

2001-07-17 Thread Martin Duerst

At 14:30 01/07/17 -0700, Mark Davis wrote:
> > In that case the content of the field is not text but an octet string,
> > and you need to do something different, like base64-ing it.
>
>The content in the database is not an octet string: it is a text field that
>happens to have a control code -- a legitimate character code -- in it.
>Practically every database allows control codes in text fields. (And why are
>C1 controls allowed? After all, they are even less frequent than C0
>controls.)

Mark - I understand your dissatisfaction. But the C1 controls are not
allowed in HTML4, and according to James Clark, the fact that they are
allowed in XML was an oversight.

Databases can (and should) keep care of their data. There are very
few cases where having control characters in there makes sense.
In the most cases, however, they are errors, and if XML gives an
incentive to fix them, all the better.

I wouldn't want any control codes in a database. Having a control-G
may be funny (the joke as I know it goes back to Don Knuth), but
something like a control-S is too much of a risk.


Regards,   Martin.




Re: Unicode, UTF-8 and Extended 8-Bit Ascii - Help Needed

2001-07-10 Thread Martin Duerst

At 11:52 01/07/10 +0100, Stephen Cowe - Sun Scotland wrote:
>Hi Unicoders,
>
>I am new to the list and would be really grateful if you could help me out 
>here.
>
>I am trying to discover if the "extended latin" 8-bit ascii (decimal
>values 128-255, Hex A0-FF), i.e. ISO-8859-1 are supported by UTF-8,

Yes.

>and
>if so, are the values the same.

No.

Regards,   Martin.



>The reason why I am asking this is because our EDIFACT EDI system
>requires to send extended latin European characters (using the UNOC version 3
>syntax identifier) and our global internal messaging system is being 
>converted
>to UTF-8.
>
>I have had a good search of the Unicode web-site but do not seem to be able to
>find the answer, yes or no, that I require.
>
>I look forward to hearing from you, kind regards,
>
>Stephen Cowe.
>
>eCommerce Technologist
>GSO IT EDI/EDE
>+44 (0)1506 672541 (Tel)
>+44 (0)1506 672893 (Fax)
>[EMAIL PROTECTED]
>





Innovative use of Latin ?!

2001-07-02 Thread Martin Duerst

For people interested in new scripts, and new uses
of existing scripts :-)
http://www.google.com/intl/xx-hacker/

Regards,   Martin.





Re: XML Blueberry Requirements

2001-06-22 Thread Martin Duerst

Hello Elliotte,

Just two points:
- If you are suggesting that discussion move to xml-dev, can you
   please give the full address of that mailing list?
- I suggest you/we don't cross-post [EMAIL PROTECTED], because
   it's not an issue the Unicode consortium has to decide.
   (I'm just cross-posting them once more in case anybody there
disagrees.)

Regards,   Martin.

At 09:37 01/06/21 -0400, Elliotte Rusty Harold wrote:
>This is going out to three mailing lists. I'd like to add a fourth and 
>suggest that future discussion take place on xml-dev, which probably has 
>the broadest reach of interested parties.





Re: UTF-8 signature in web and email

2001-05-22 Thread Martin Duerst

At 00:07 01/05/23 +0100, Juliusz Chroboczek wrote:
>MS-DOS users, on the other hand, expect applications to have pro-
>prietary formats, and are quite happy to go through convoluted con-
>version procedures in order to access their data (to the extent to
>which they are happy in the first place).  Heck, MS-DOS doesn't even
>have the concept of concatenating plain files!

Some addition:

MS-DOS and Windows also have file types based on extensions.
For the typical case where the UTF-8 'signature' was promoted,
namely a Win system with a single legacy encoding and UTF-8,
introducing a new extension for UTF-8 files (e.g. .t8t instead
of .txt) would have had the same benefits as the 'signature',
and much less problems. And it can still be done!


Regards,Martin.




Re: UTF-8 signature in web and email

2001-05-18 Thread Martin Duerst
At 22:58 01/05/17 -0400, [EMAIL PROTECTED] wrote:
>Martin D$B—S(Bst wrote:
>
> > There is about 5% of a justification
> > for having a 'signature' on a plain-text, standalone file (the reason
> > being that it's somewhat easier to detect that the file is UTF-8 from the
> > signature than to read through the file and check the byte patterns
> > (which is an extremely good method to distinguish UTF-8 from everything
> > else)).
>
>A plain-text file is more in need of such a signature than any other type of
>file.  It is true that "fancy" text such as HTML or XML, which already has a
>mechanism to indicate the character encoding, doesn't need a signature, but
>this is not necessarily true of plain-text files, which will continue to
>exist for a long time to come.
>
>The strategy of checking byte patterns to detect UTF-8 is usually accurate,
>but may require that the entire file be checked instead of just the first
>three bytes.  In his September 1997 presentation in San Jose, Martin conceded
>that "Because probability to detect UTF-8 [without a signature] is high, but
>not 100%, this is a heuristic method" and then spent several pages evaluating
>and refining the heuristics.  Using a signature is not somewhat easier, it is
>*much* easier.

Sorry, but I think your summary here is a bit slanted.
I indeed used several pages, but the main aim was to show that
in practice, it's virtually 100%, for many different cases.
People using this heuristic, who didn't really think it would
work that well after the talk, have confirmed later that it
actually works extremely well (and they were writing production
code, not just testing stuff). On the other hand, I never met
anybody who showed me an example where it actually didn't work.
I would be interested to know about one if it exists.

I just said 'high, but not exactly 100%', because it was a technical
talk and not a marketing talk. Could be that this wasn't
easy to understand for some of the audience? There is no actual
need in practice to refine the heuristics.

The use of the signature may be easier than the heuristic in particular
if you want to know before reading a file what the encoding of the
file is. But in most cases, you will want to convert it somehow,
and in that case, it's easy to just read in bytes, and decide
lazily (i.e. when seing the first few high-octet bytes) whether
to transcode the rest of the file e.g. as Latin-1 or as UTF-8.

Also, the signature really only helps if you are only dealing with
two different encodings, a single legacy encoding and UTF-8.
The signature won't help e.g. to keep apart Shift_JIS, EUC, and
JIS (and UTF-8), but the heuristics used for these cases can
easily be extended to UTF-8.


> > - When producing UTF-8 files/documents, *never* produce a 'signature'.
> >There are quite some receivers that cannot deal with it, or that deal
> >with it by displaying something. And there are many other problems.
>
>If U+FEFF is not interpreted as a BOM or signature, then by process of
>elimination it should be interpreted as a zero-width no-break space (ZWNBSP;
>more on this later).  Any receiver that deals with a ZWNBSP by displaying a
>visible glyph is not very smart about they way it handles Unicode text, and
>should not be the deciding factor in how to encode it.

Don't think that display is everything that can be done to a text
file. An XML processor that doesn't expect a signature in UTF-8
will correctly reject the file if the signature comes before
an XML declaration. Same for many other formats and languages.


>What are the "many other problems"?  Does this comment refer to programs and
>protocols that require their own signatures as the first few bytes of an
>input file (like shell scripts)?  The Unicode Standard 3.0 explicitly states
>on page 325, "Systems that use the byte order mark must recognize that an
>initial U+FEFF signals the byte order; it is not part of the textual
>content."  Programs that go bonkers when handed a BOM need to be corrected to
>conform to the intent of the UTC.

This would mean changing all compilers, all other software dealing with
formated data, and so on, and all unix utilities from 'cat' upwards.
In many cases, these applications and utilities are designed to work
without knowing what the encoding is, they work on a byte stream.
This makes it just impossible to conform to the above statement.
If you have an idea how that can be solved, please tell us.

The problem goes even further. How should the 'signature' be handled
in all the pieces of text data that may be passed around inside
an application, or between applications, but not as files?
Having to specify for each case who is responsible to add or
remove the 'signature', and doing the actual work, is just crazy.


Regards,   Martin.


Re: UTF-8 signature in web and email

2001-05-16 Thread Martin Duerst

Hello Roozbeh

At 04:02 01/05/15 +0430, Roozbeh Pournader wrote:

>Well, I received a UTF-8 email from Microsoft's Dr International today. It
>was a "multipart/alternative", with both the "text/plain" and "text/html"
>in UTF-8. Well, nothing interesting yet, but the interesting point was
>that the HTML version had a UTF-8 signature, but the text version lacked
>it. So, the HTML version had it three times: mime charset as UTF-8,
>UTF-8 signature, and  charset markup.

This is definitely overblown. There is about 5% of a justification
for having a 'signature' on a plain-text, standalone file (the reason
being that it's somewhat easier to detect that the file is UTF-8 from the
signature than to read through the file and check the byte patterns
(which is an extremely good method to distinguish UTF-8 from everything
else)). For self-labeled data (HTML, XML, CSS) and in the context
of MIME (with the charset parameter), an UTF-8 signature doesn't
make sense at all.


>Questions:
>
>1. What are the current recommendations for these?

- When producing UTF-8 files/documents, *never* produce a 'signature'.
   There are quite some receivers that cannot deal with it, or that deal
   with it by displaying something. And there are many other problems.

- When receiving UTF-8, you probably should check for a 'signature'
   and remove it. There are too many applications that send one out,
   unfortunately.


>2. Most important of all, does W3C allow UTF-8 signatures before
>""? And if yes, what should be done if they mismatch the
>charset as can be described in the  tag?

For text/html, neither the HTML spec nor the IETF definition of UTF-8
(RFC 2279) says anything as far as I know. The reason was that nobody
thought about an UTF-8 signature at that time.

For XML, the 'signature' is now listed in App F.1
http://www.w3.org/TR/REC-xml#sec-guessing-no-ext-info
But this is not normative, and fairly recent, and so you should never
expect an XML processor to accept it (except as a plain character
in the file when there is no XML declaration).


Regards,   Martin.






RE: Unicode in a URL

2001-04-26 Thread Martin Duerst

Hello Mike,

At 19:09 01/04/26 -0600, Mike Brown wrote:
> > W3C specifies to use %-encoded UTF-8 for URLs.
>
>I think that's an overstatement.
>Neither the W3C nor the IETF make such a specification.

True. Neither W3C nor IETF make such a general statement,
because we can't just remove the about 10 years of history
of URIs.


>http://www.w3.org/TR/charmod/#sec-URIs
>contains many ambiguities, conflicts with XML and HTTP,
>and is not yet a recommendation.

Thanks for having had a look at that. It may of course
contain some ambiguities, and may need improvement in
presentation and wording (that's why we have put it
out for last call, and we are now working on the comments
we got). It's also true that it's not yet a recommendation.
But XML is a Recommendation, and XLink and XML Schema are
both Proposed Recommendations (i.e. close to Recommendations),
and they all say the same, in the places where they have to
say something about URIs.

And I'm aware of no conflicts with XML or HTTP at all.


>I wrote a little about this topic at
>http://skew.org/xml/misc/URI-i18n/

Overall, that's an extremely well written and well presented
document. But it contains a crucial misunderstanding.

It gives the following example (sorry for the "e'"; my Japanese
mailer doesn't handle Latin-1):

 
Here is a scenario that illustrates how the assumption of UTF-8
based escaping could conflict with the URI spec's deference to the
scheme specs:



http://somewhere/getgreeting?lang=es&name=C%C3%A9sar";>
]>
&greeting;

The name Ce'sar is represented here as C%C3%A9sar in the UTF-8 based escaping,
as per the XML requirement.
 

This is wrong. There is no requirement to use UTF-8 for all the %hh escapes
in system literals. A system literal is an URI, and you can use whatever
%hh sequence you want. In particular, if you have to send the corresponding
Latin-1 bytes to "http://somewhere/getgreeting";, then you can use
http://somewhere/getgreeting?lang=es&name=C%E9sar. URIs like this have
worked for a long time, and there is no reason nor intention from W3C
to stop you (if you really need that).

What the XML spec (and all the others mentioned above) say is something
different. Assume the following example:

 


http://somewhere/getgreeting?lang=es&name=César";>
]>
&greeting;
  

Here there is an actual e-acute character in the file (I just used a numeric
character reference to make sure it gets through email). This can't be sent
off directly, and it's better if we clearly say what an XML processor
(which is the thing that interprets the XML, resolving the entities
and doing other parsing-related things) has to do.

It is this case to which the XML spec, the character model, and so on, apply.
In this case, the e-acute character is converted to %C3%A9, and the XML
processor tries to get the entity from
 http://somewhere/getgreeting?lang=es&name=C%C3%A9sar

So is short, what the XML spec is saying is:

- If you use non-ASCII characters directly in a system id, they're converted
   using UTF-8.
- If you want anything else, use exactly the %-escapes you want. You won't
   get the benefit of using the actual character in the source document.

I hope this clears things up a bit. If there is anything in the XML spec
or any other spec that you think should be changed to make this clearer,
we are always open to suggestions. But I think the fact that it says
"The XML processor must escape disallowed characters as follows:" makes
it quite clear that this happens on parsing, not when creating the XML
document.


Regards,   Martin.




RE: Unicode in a URL

2001-04-26 Thread Martin Duerst

At 15:02 01/04/26 -0700, Paul Deuter wrote:
>Based on the responses, I guess my original question/problem was not
>very well written.

>The %XX idea does not work because this it already in use by lots of
>software
>to encode many different character sets.  So again we need something that
>identifies
>it as UTF-8.

It's used with lot's of different encodings. Adding one more (UTF-8)
won't make it much worse, in the first place.

Second, it turns out that UTF-8 is extremely easy to detect/check,
the easiest of all encodings. For details, see
http://www.ifi.unizh.ch/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf

Apart from that, the HTTP protocol says exactly what you can send,
and so you can't just invent something new (such as %u),
even though it might work 'sometimes'.


>I see this as somewhat analogus to the invention of the U+ notation
>in Unicode consortium writings?  They needed a completely unambiguous way
>to tell their readers that the 16 bit value was not "any" 16 bit value
>but rather a specific Unicode codepoint.  They invented a new kind of escape
>sequence that said two things: what follows is hex *and* Unicode.
>
>I see the BOM as filling the same need for text files.  It was not enough
>to invent Unicode but also a way to identify the encoding.

The BOM for UTF-8 is doing a lot of damage. All the tools that
would work very nicely without the BOM stop to work.


Regards,Martin.




Re: Unicode in a URL

2001-04-26 Thread Martin Duerst

Hello Paul,

At 19:41 01/04/25 -0700, Paul Deuter wrote:
>I am struggling to figure out the correct method for encoding Unicode
>characters in the
>query string portion of a URL.
>
>There is a W3C spec that says the Unicode character should be converted to
>UTF-8 and
>then each byte should be encoded as %XX.

It also says that form data should be encoded in the encoding of
the page where you fill in the form.


> From my experience however,
>browsers will
>encode all character sets this way and IIS at least will interpret such hex
>bytes according
>to the character set that is set on the receiving page.

Well, communications takes two ends. Each server, for each
URI, has to decide what to do. If you want to make use of
the UTF-8 convention, you have to set your server side
accordingly.


>With IIS 5.0, I have stumbled onto the solution of using %u where 
>is the
>hexadecimal value of the Unicode character.  When I pass Unicode data
>formatted this way on
>Windows 2000/IIS5 - the data always seems to be decoded properly.
>(Apparently this
>format came from ECMAScript.)

This was a one-time ECMAScript solution. The ECMAScript standard now
has functions to support the UTF-8 convention.

The reason the %u was discontinued was that it's outside the
URI syntax, and therefore can break all kinds of things.


Regards,Martin.




Re: Unicode in a URL

2001-04-26 Thread Martin Duerst

At 11:28 01/04/26 -0700, Markus Scherer wrote:
>Paul Deuter wrote:
> > I am wondering if there isn't a need for the Unicode Spec to also
> > dictate a way of encoding Unicode in an ASCII stream.  Perhaps
>
>How many more ways to we need?
>
>To be 8-bit-friendly, we have UTF-8.
>To get everything into ASCII characters, we have UTF-7.
>W3C specifies to use %-encoded UTF-8 for URLs.

Unfortunately, there is more.

HTML/XML use &#d; (d is a decimal number) or
&#xh;  ( is hexadecimal). Java has \u CSS has \h.

It would be very nice if there were only one convention everywhere,
but the circumstances (and their history) make that very difficult.

Another issue is that if you start combining things (e.g.
producing XML from Java or Perl,..., it can be much less
confusing if each language has different conventions.

Regards,   Martin.




RE: How will software source code represent 21 bit unicode characters?

2001-04-17 Thread Martin Duerst

At 09:29 01/04/17 -0500, [EMAIL PROTECTED] wrote:
> > In a perfect world, we would probably have an enclosing symbol (e.g.
> > '\<4E00>') so that the number can be variable length.
>
>
>In Perl the notation is \x{...}, where ... is hexdigit sequence:
>\x{41} is LATIN CAPITAL LETTER A while \x{263a} is WHITE SMILING FACE,
>and \x{1D400} is MATHEMATICAL BOLD CAPITAL A.

And in XML it's 一 which can also be variable length.
There is also &#d; where d is a decimal number, but
that's not that useful anymore now that standards use hexadecimal.

Regards,   Martin.




Re: Reviewing IETF documents

2001-04-16 Thread Martin Duerst

Hello Florian - There is no official or coordinated review of IETF
documents. Because of the volunteer nature of the IETF, it mostly
depends on individuals.

I have been in contact with the USEFOR group for a while.
What particular serious problem are you speaking about?

If you know about a problem, just tell the WG. In many cases,
it may take a few iterations to achieve common understanding.
If no solution is achieved, there are ways to escalate the
issue, but I hope we don't need to get there.

Regards,   Martin.

At 23:51 01/04/15 +0200, Florian Weimer wrote:
>"Tex Texin" <[EMAIL PROTECTED]> writes:
>
> > > Is someone constantly reviewing IETF documents (drafts and RFCs) which
> > > use UTF-8 or other Unicode-related technology?
>
> > Are there any specific documents that you are aware of that have
> > a problem or potential problem?
>
>So far, I've looked more closely at one RFC (2440, i.e. OpenPGP) and a
>draft (USEFOR), and in the case of USEFOR, there are a few minor
>problems, and an extremly severe, even dangerous one (it has to do
>with the use of Unicode in identifiers).  I'm still investigating the
>situation (see my other message).
>
>OpenPGP uses a concept of line endings which is only adequate for
>ASCII text, and Unicode in identifiers as well (with only slight
>interoperability issues such as database queries).  Usually, OpenPGP
>identifiers (i.e. user IDs) are not processed mechanically, so these
>is not a grave issue.





Re: Identifiers

2001-04-15 Thread Martin Duerst

Hello Florian,

Of course, KC/KD-normalization is not sufficient. The problem
already exists in ASCII. I/l/1 and 0/O can easily be confused.
It will always be necessary for people to think a bit when creating
their email addresses,...

On the other hand, when identifiers can be written in various
scripts, this will help avoid spelling and transcription errors
by people who are not familiar with the Latin script and the
various transliteration conventions.

Overall, there is a kind of 'natural selection'. The creators
of identifiers will find out one way or another what identifiers
work and what don't.

Of course, normalization (preferably NFC and/or NFKC, to stay in
line with the W3C and the IETF) can help quite a bit.
NFC only eliminates things that are supposed to look exactly
the same. NFKC eliminates quite a bit more than that.

Regards,   Martin.

At 20:10 01/04/15 +0200, Florian Weimer wrote:
>Unicode is finally entering domains which were ASCII-only for decades.
>However, with some kinds of identifiers, new problems occur.  Such
>identifiers are interpreted by humans and machines, and they have to
>survive printing and reentering.  Furthermore, it might not be
>possible to check identifiers online (in contrast to programming
>language identifiers).  Think of local-parts of email addresses for an
>example.
>
>Is it sufficient to mandate that all such identifiers MUST be KC- or
>KD-normalized?  Does this guarantee print-and-enter round-trip
>compatibility?





RE: Ruby Annotation and XHTML 1.1 are W3C Proposed Recommendations

2001-04-10 Thread Martin Duerst

At 10:00 01/04/09 -0700, Carl W. Brown wrote:
>I am wondering how in the absence of a sub language how one should render
>Chinese ruby.  Mandarin ruby will not do a Cantonese reader much good.  Can
>I specify multiple ruby and then have one displayed depending on the spoken
>language?

Maybe that's one reason for ruby not beeing that much used in Chinese
as in Japanese? But it could be very interesting for somebody
speaking Cantonese and wanting to learn Mandarin, or vice versa.
You can give up to two ruby per base text, so you could have Mandarin
on one side and Cantonese on the other, or could switch on one
or the other with a stilesheet. For more advanced things, you would
need something like SMIL, which has an explicit  statement.


>I have lamented the lack of a good IME interface to capture ruby as the text
>is entered.  If nothing else they can be useful for some types of sorting.

Yes, this is an interesting user requirement. But it's a matter of
editing tools, not something that can be done as part of e.g. the
(X)HTML specification.


>On a related subject:
>
>Has any further consideration been given to using style sheets to control
>the IME?  It is a pain to switch between fields that have different types of
>input.  For example if I am entering a URL that does not support Han and the
>a name field that does then a numeric field it is extra work for the typist
>to change the IME settings.

We have had various discussions about this, it's very clear that there
is a need. However, there are various quite different ways to address it,
and we haven't yet found one that satisfies enough requirements and covers
enough cases to be able to agree on it. But for XForms (see
http://www.w3.org/MarkUp/Forms/), we definitely need something like this.

I propose to move further discussion to the [EMAIL PROTECTED]
mailing list.

Regards,   Martin.





Ruby Annotation and XHTML 1.1 are W3C Proposed Recommendations

2001-04-08 Thread Martin Duerst

Ruby Annotation (http://www.w3.org/TR/ruby) and
XHTML(TM) 1.1 - Module-based XHTML (http://www.w3.org/TR/xhtml11)
became W3C Proposed Recommendations on April 6, 2001.


Abstract of 'Ruby Annotation':

"Ruby" are short runs of text alongside the base text, typically used
in East Asian documents to indicate pronunciation or to provide a short
annotation. This specification defines markup for ruby, in the form of
an XHTML module.

More information about Ruby markup can be found at
http://www.w3.org/International/O-HTML-ruby.

XHTML 1.1 is a cleaned up version of XHTML 1.0 Strict, defined using
XHTML Modularization (http://www.w3.org/TR/xhtml-modularization/),
with the addition of ruby markup as defined in Ruby Annotation.


A Proposed Recommendation is believed to meet the relevant requirements,
to represent sufficient implementation experience, and to adequately address
comments from community reviews. Proposed Recommendations are reviewed by
the W3C Member Organizations.
(http://www.w3.org/Consortium/Process-20010208/tr.html#Recs)

Ruby Annotation has been produced as part of the W3C Internationalization
Activity (http://www.w3.org/International/Activity) by the
Internationalization Working Group, with the help of the Internationalization
Interest Group (I18N IG). Technical and editorial comments should be sent to
the publicly archived mailing list [EMAIL PROTECTED]


Regards,Martin.