date:20030925

On 25/09/2003 14:25, Markus Scherer wrote:

Peter Kirk wrote:

On 25/09/2003 12:27, [EMAIL PROTECTED] wrote:

It's not a reordering per se, as the first combining character is 
given the first "opportunity" to combine.
 
Thanks for the clarification.


In other words, yes, Unicode's NFC does perform "discontiguous 
composition". Some things might be easier if only contiguous 
composition were used, but the current definition does give you the 
shortest strings.
And this current definition cannot be changed because of the stability 
policy, right?

See also http://www.unicode.org/notes/tn5/#FCC (not a normative 
Unicode document).

markus


Thanks.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

FW: Web Form: Other Question: Unicode characters in Form in MSAccess

2003-09-25 Thread Magda Danish \(Unicode\)

Hi,

I am forwarding your question to the Unicode list for possible answer
from one of the list subscrbers.

Regards,

Magda Danish
Administrative Director
The Unicode Consortium
650-693-3921
 

> -Original Message-
> Date/Time:Tue Sep 23 04:06:15 EDT 2003
> Contact:  [EMAIL PROTECTED]
> Report Type:  Other Question, Problem, or Feedback
> 
> Greeings,
> I have a problem using Unicode outeside the web.
> while in IE5.0 or higher ,when I choose UTF-8 encoding I can 
> see unicode data easily.These data are stored in a SQL2000 
> database.But when I want to print them in a Form in MSAccess 
> or Crystal Reports9.0(i.e. Outside IE 5.0) the characters are 
> unreadable.
> What shall I do to overcome this problem?
> Best Regards
> M.Janbeglou
> 
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> (End of Report)
> 
> 
>

Re: About that alphabetician...

2003-09-25 Thread Deborah Goldsmith

I already wrote this up internally as a bug.

Thanks,
Deborah
On 2003/09/25, at 14:05, Tom Gewecke wrote:

About the c-cedilla, it appears that OS X Safari does not  pick  up 
the charset on this page.  If the default is set to UTF-8, the c 
disappears altogether.  The  correct character is displayed only if 
the browser is set by default  or manually to Latin 1.

c-cedilla problem at NYT

2003-09-25 Thread Tom Gewecke

PS The reason the Latin 1 charset  is not picked up by a browser would 
appear to be bad html.  The page has



instead of

RE: About that alphabetician...

2003-09-25 Thread Tom Gewecke

About the c-cedilla, it appears that OS X Safari does not  pick  up the 
charset on this page.  If the default is set to UTF-8, the c disappears 
altogether.  The  correct character is displayed only if the browser is 
set by default  or manually to Latin 1.

RE: AddDefaultCharset considered harmful (was: Mojibake on my Web pages)

2003-09-25 Thread Paul Deuter

Here is a link which describes how some hackers use 
%XX and %u url encoding to mask a malicious request
or to get around an IDS product.

http://www.cgisecurity.com/contrib/hd_spring_2002.pdf

-Paul

-Original Message-
From: Martin Duerst [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 25, 2003 1:32 PM
To: Doug Ewell; Unicode Mailing List
Subject: AddDefaultCharset considered harmful (was: Mojibake on my Web
pages)

Hello Doug, others,

Here is my most probable explanation:
Adelphia recently upgraded to Apache 2.0. The core config file (httpd.conf)
as distributed contains an entry
 AddDefaultCharset iso-8859-1
which does what you have described. They probably adopted this
because the comment in the config file suggests that it's important.

I have just filed a bug with bugzilla, asking that this default
setting be removed or commented out, and the comment fixed, at
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23421. You may
want to vote for that bug.

I have also commented on a related bug that I found, at
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14513.

I suggest you tell your Internet provider:
1) that they change to AddDefaultCharset Off
(or simply comment this out)
2) that they make sure you get FileInfo permission in your directories,
so that you can do the settings you know you are correct.

The comment in the config file contains mostly very strange statements:

#
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset ISO-8859-1
 >>>

If anybody knows something about these security issues, please
tell me (any mention of security issues usually has webmasters
in control, for good reasons).

Regards,   Martin.

At 22:40 03/09/22 -0700, Doug Ewell wrote:
>Apologies in advance to anyone who visits my Web site and sees garbage
>characters, a.k.a. "mojibake."  It isn't my fault.
>
>Adelphia is currently having a character-set problem with their HTTP
>servers.  Apparently they are serving all pages as ISO 8859-1 even if
>they are marked as being encoded in another character set, such as
>UTF-8.

>If you manually change the encoding in your browser to UTF-8, or
>download the page and display it as a local file, everything looks fine
>because Adelphia's server is no longer calling the shot.  Their tech
>support people acknowledge that the problem is at their end and said
>they would look into it.
>
>I understand that having the "Unicode Encoded" logo on my page next to
>these garbage characters may not reflect well on Unicode, especially to
>newbies.  I'm considering putting a disclaimer at the top of my pages,
>but I'm waiting to see how quickly they solve the problem.
>
>-Doug Ewell
>  Fullerton, California
>  http://users.adelphia.net/~dewell/

Re: Unicode Normalisaton Optimisation Experiments

2003-09-25 Thread Markus Scherer

Peter Kirk wrote:
On 25/09/2003 12:27, [EMAIL PROTECTED] wrote:
It's not a reordering per se, as the first combining character is 
given the first "opportunity" to combine.
 
Thanks for the clarification.
In other words, yes, Unicode's NFC does perform "discontiguous composition". Some things might be 
easier if only contiguous composition were used, but the current definition does give you the 
shortest strings.

See also http://www.unicode.org/notes/tn5/#FCC (not a normative Unicode document).

markus

AddDefaultCharset considered harmful (was: Mojibake on my Web pages)

2003-09-25 Thread Martin Duerst

Hello Doug, others,

Here is my most probable explanation:
Adelphia recently upgraded to Apache 2.0. The core config file (httpd.conf)
as distributed contains an entry
AddDefaultCharset iso-8859-1
which does what you have described. They probably adopted this
because the comment in the config file suggests that it's important.
I have just filed a bug with bugzilla, asking that this default
setting be removed or commented out, and the comment fixed, at
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=23421. You may
want to vote for that bug.
I have also commented on a related bug that I found, at
http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14513.
I suggest you tell your Internet provider:
1) that they change to AddDefaultCharset Off
   (or simply comment this out)
2) that they make sure you get FileInfo permission in your directories,
   so that you can do the settings you know you are correct.
The comment in the config file contains mostly very strange statements:


#
# Specify a default charset for all pages sent out. This is
# always a good idea and opens the door for future internationalisation
# of your web site, should you ever want it. Specifying it as
# a default does little harm; as the standard dictates that a page
# is in iso-8859-1 (latin1) unless specified otherwise i.e. you
# are merely stating the obvious. There are also some security
# reasons in browsers, related to javascript and URL parsing
# which encourage you to always set a default char set.
#
AddDefaultCharset ISO-8859-1
>>>
If anybody knows something about these security issues, please
tell me (any mention of security issues usually has webmasters
in control, for good reasons).
Regards,   Martin.



At 22:40 03/09/22 -0700, Doug Ewell wrote:
Apologies in advance to anyone who visits my Web site and sees garbage
characters, a.k.a. "mojibake."  It isn't my fault.
Adelphia is currently having a character-set problem with their HTTP
servers.  Apparently they are serving all pages as ISO 8859-1 even if
they are marked as being encoded in another character set, such as
UTF-8.

If you manually change the encoding in your browser to UTF-8, or
download the page and display it as a local file, everything looks fine
because Adelphia's server is no longer calling the shot.  Their tech
support people acknowledge that the problem is at their end and said
they would look into it.
I understand that having the "Unicode Encoded" logo on my page next to
these garbage characters may not reflect well on Unicode, especially to
newbies.  I'm considering putting a disclaimer at the top of my pages,
but I'm waiting to see how quickly they solve the problem.
-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: a little more help understanding diacritical encoding

2003-09-25 Thread Paul Deuter

It would appear that your server side is in Java.
There is a well known issue in older versions of the
Java servlet spec that cause the request class to 
assume that %HH encoded octets are 8859-1 octets.
It seems that this is your problem.
The workaround is to get the parameters from the
request object and turn them back into bytes and
then re-interpret them as UTF-8 (because that is
what they are).

The code to do that looks like this:

String strFoo = new String(request.getParameter("whatever").getBytes(8859_1), "UTF-8");

-Paul



-Original Message-
From: Steve Pruitt [mailto:[EMAIL PROTECTED]
Sent: Thursday, September 25, 2003 9:03 AM
To: [EMAIL PROTECTED]
Subject: a little more help understanding diacritical encoding


Thanks for the excellent responses.  I now understand how C3 and 89 are derived.  I 
tried getting everything set the way I intrepreted what the list responses said to do. 
 The scenario is:
I have a page with some diacritical characters displayed and a input text box and a 
submit button.  I copy and past one of the displayed characters into the input box and 
then submit.  What is submitted gets echoed back.  The pages use style sheets so I cut 
and pasted the relevant tags, etc.

I thought I found the problem.  My response had a character encoding of null.  I read 
null defaults to 8859-1 which seemed consistent with my echoed page.  So, I explicitly 
set the response character encoding to UTF-8 via the setContentType method.

I used a TCP tunneler to see what my request and responses look like.  My browser is 
set to utf-8 also.

>From the tunneler my request had the following posted data:  v904=%C3%89   this is 
>correct according to how the utf encoding algo was explained.

The http response had the following:

Content-Type: text/html; charset=UTF-8   this is correct.

  is a child in the 
 tag

É ê ë í î ï ð ñ ó 
ô õ ö  these are the listed characters on the previous page I 
cut and past from they are listed on this page just for reference - (#201 = C9) is É.

Accented Characters from  previous 
form:  Ã‰ 
this is echoed back.  #195 = C3 and #137 = 89.  These, of course, are displayed as Ã?.

I checked the browser to be sure and its encoding is still set to utf-8 and it is.  
This is everything I know to check.  What am I missing?

Re: About that alphabetician...

At 13:19 -0700 2003-09-25, James Caldwell wrote:

Congratulations!  You have given Unicode a tremendous boost with 
this interview, published in the New York Times!

I am sure it will bring many positive results for our work and for 
your career.
Thank you very much. Please give generously to the Script Encoding 
Initiative http://www.unicode.org/sei if you can.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: About that alphabetician...

2003-09-25 Thread Brian Doyle

Thanks for the tip. There's must be something wrong with my machine. If
anyone has any suggestions for how to troubleshoot this, please email me
privately.

On 9/25/03 1:54 PM, "John Burger" <[EMAIL PROTECTED]> wrote:

> Brian Doyle wrote:
> 
>> The observation that I, the ³Irish (American) colleague,² made to
>> Michael
>> was that there is a sentence in the NYT article displayed in my
>> browser that
>> dropped the OOE7 LATIN SMALL LETTER C WITH CEDILLA (e.g., François).
> 
> The c-cedilla is really there, I see it in three browsers on my Mac
> (Camino, Safari, and IE).
> 
> - John Burger
>  MITRE
>

Re: Michael Everson in the news

2003-09-25 Thread John Cowan

Eric Muller scripsit:
> See also , 
> which is apparently about SEI.

Interestingly, both this and the NYT article are written by the same
person: Michael Erard.

-- 
John Cowan  [EMAIL PROTECTED]
http://www.ccil.org/~cowan  http://www.reutershealth.com
Thor Heyerdahl recounts his attempt to prove Rudyard Kipling's theory
that the mongoose first came to India on a raft from Polynesia.
--blurb for _Rikki-Kon-Tiki-Tavi_

RE: About that alphabetician...

2003-09-25 Thread Asmus Freytag

At 05:41 PM 9/25/03 +0100, Richard Ishida wrote:
Aha.  Maybe, next time I try to explain it on the plane, I'll say
something like:
"Unicode is a standard for enabling your computer to represent all the
letters of all the alphabets of the world."
Still not terribly accurate and deliberately vague (and could refer in
their mind to characters and/or fonts), but then the average layman
probably wouldn't know or need to know it was innaccurate or vague.
I usually like to say that

"Unicode is a simply a list, you know, like a catalog, where you can find 
all the letters of all the alphabets of the world."

That allows me to segue to the tasks that people perform.

"If all the computers in the world use the same list, you can type in any 
language anywhere and people on the opposite end of the earth can read it."

Why is this good?

"If everybody uses their own list, as used to be the case, very often 
theres a mismatch and instead of text you get garbage, or random letters on 
your screen."

For the longer answer (still for newbies) see the first part of my Unicode 
tutorial.

A./

Re: About that alphabetician...

2003-09-25 Thread John Burger

Brian Doyle wrote:

The observation that I, the “Irish (American) colleague,” made to 
Michael
was that there is a sentence in the NYT article displayed in my 
browser that
dropped the OOE7 LATIN SMALL LETTER C WITH CEDILLA (e.g., François).
The c-cedilla is really there, I see it in three browsers on my Mac 
(Camino, Safari, and IE).

- John Burger
  MITRE

Re: About that alphabetician...

At 12:49 -0500 2003-09-25, Brian Doyle wrote:

The observation that I, the "Irish (American) colleague," made to Michael
was that there is a sentence in the NYT article displayed in my browser that
dropped the OOE7 LATIN SMALL LETTER C WITH CEDILLA (e.g., François).
There's nothing in the paragraph in question to indicate that there is a
missing character--nor is there a numeric code displayed for a savvy user to
look up.
I see the ç when I view the page and I'm using Safari as you are.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: About that alphabetician...

2003-09-25 Thread Marco Cimarosti

Michael Everson wrote:
> At 08:33 -0700 2003-09-25, John Hudson wrote:
> 
> >Unicode is an encoding standard for text on computers that allows 
> >documents in any script and language to be entered, stored, edited 
> >and exchanged.
> 
> >>blank stare from layman<<

Unicode is a code in which every letter of every alphabet in the world
corresponds to a number. This numeric code is used to write text inside
computers, because only can be written numbers inside computers. When the
computer shows on the screen the text which it has inside, it draws the
letters corresponding to the Unicode numbers which it has inside.

My 4 years listened to this explanation and said everything was clear.

The only problem is that he now wants to disassemble my computer to see the
numbers it has inside. He thinks that the numbers are stored in the form of
talking ladybugs which would say out the number when you tip on them (he
gained this idea from one of his favorite books: "Learn the Numbers with the
Talking Ladybugs").

_ Marco

Re: About that alphabetician...

2003-09-25 Thread Brian Doyle

Eric,

Forgive my density. I¹m not sure that I understand. Are you arguing that an
ASCII encoding scheme (ISO-8859-1) is not a limitation because,
semantically, all of the characters (a, b, c, etc.) also exist in the
Unicode scheme?

It makes sense to me that ASCII is not a limitation for those documents that
are limited to that character set. But, your own message, ³which contains
U+10DB ? GEORGIAN LETTER MAN and U+092E Ã DEVANAGARI LETTER MA² triggers an
error message in my own email client (Entourage X), namely:

³Some text in this message is in a langauge that your computer cannot
display.²

I¹m not certain if I¹m seeing this because I don¹t possess a font to display
those characters or some other reason. I suspect that this is the reason
because, when I try to look up those character's in OS X's Character
Palette, the Georgian and Devongari Unicode blocks show up blank.

The observation that I, the ³Irish (American) colleague,² made to Michael
was that there is a sentence in the NYT article displayed in my browser that
dropped the OOE7 LATIN SMALL LETTER C WITH CEDILLA (e.g., François).

There's nothing in the paragraph in question to indicate that there is a
missing character--nor is there a numeric code displayed for a savvy user to
look up.

Surely in this context, we would agree that the semantic content was
distorted, yes?

Sincerely,
Brian Doyle
Unicode newbie

On 9/25/03 11:54 AM, "Eric Muller" <[EMAIL PROTECTED]> wrote:

> 
> 
> Michael Everson wrote:
>> An Irish colleague here said he liked the article but noted that the Times'
>> web directors don't use Unicode
>> 
>> 
>>> ...  
>>> 
>>> ...  
>>> 
>>> 
> There is an alternative point of view, which says that charset declared in an
> HTML (or XML) document is no more than an encoding scheme, and that all
> characters in those documents are fundamentally Unicode characters (i.e. they
> start in life with the full semantic of Unicode, they don't inherit it on the
> occasion of character set conversion). That view is supported by the XML spec
> itself, and by the infoset definition. And because we have numeric character
> entities, using an iso-8859-1 encoding scheme is not really a limitation:
> witness this message, which contains U+10DB ? GEORGIAN LETTER MAN and U+092E Ã
> DEVANAGARI LETTER MA.
> 
> Eric.
> 
>

Re: Questions on Myanmar encoding

2003-09-25 Thread Maung TunTunLwin

Hello Mr. Eric Muller,

> It is in Unicode 4.0, section 10.3, page 273,  and  you can see it  at:
> 

Thank.

> > 1021 1013 1031 101B 102D 1000 1014 1039 200C 1012 1031 102C 1039 200C
101C
> >102C 0020 1042 1048 0020 1042 002C 1040 1040 1040 0020 1000 1030 100A
102E=>
> >"US$28 2,00 ...?" I think help? 1000 1030 100A 102E 1015 102C
> >
> >Just one character wrong 1031on third place should be 1012.
> >
> my original: 1021 1013 1031...
> your correction: 1021 1013 1012 ...
>
> I am a bit confused, and looking more carefully, my new guess is: 1021
> 1019 1031... Apparently, that makes the first word sound like "american".

Sorry, my misstake. It should be second place 1013 -> 1012. You may be right
with your sample but currently $ use with 1012.


> >1010 102D 101B 1005 1039 1006 102C 1014 1039 200C 1025 101A 1039 101A
102C
> >1025 1039 200C 1018 102F 102D 1037 0020 2018 1015 1004 1039 200C 1012
102C
> >1014 102E 2019 0020 101B 1031 102C 1000 1039 200C => " 'PandaNi' for
zoo..."

> I think I understand. Also, I corrected 1018, which should be 101E.

1018 102F 102D 1037 (for), 101E 102F 102D 1037 (to) Both is useable.

> Just to be clear, I am not proposing any modification to the encoding
> model. At best, I can think of clarifications that could help people
> like me, who have limited knowledge of the script.

I am also not to try to change the standard. I am currently trying to figure
out currenting encoding limitations and looking for ways to extend it.

> In another place in your message, you mention that the current model is
> not optimal for sorting. I am not a specialist of sorting, but this is
> not an entirely unusual situation. It is in general not possible to make
> the encoding model such that it is optimal for all processings
> (rendering, sorting, etc.) You may want to check carefully the UCA, to
> see if and how it can handle proper sorting.

Yes I know and thank for your advice.
I'm finally accupting the encoding model is not optimal for rendering and
sorting. But there is still two thing I am still afraid,...
One: encoding model must have abality to quick word cutting for sort, wrap,
search.
-Currently I see posibility with wraping at graphite.
Two: encoding model must useable with current rendering systems or it will
be in paper tiger, (three years!).
-I see it can work with Graphite with intelligent input method. But what
about other system? OpentypeFont doesn't handle line wraping Uniscribe did.
But what about Vowel Sign E (1031) handeling? to move front and back?

Sorry I put up too much feeling.

Maung TunTunLwin
[EMAIL PROTECTED]

RE: About that alphabetician...

At 08:33 -0700 2003-09-25, John Hudson wrote:

Unicode is an encoding standard for text on computers that allows 
documents in any script and language to be entered, stored, edited 
and exchanged.

blank stare from layman<<

I think it is best to relate the description to what the layman 
does: he types things, and he edits them and he sends them to other 
laymen. The 'big font' thing is a really bad idea because it is 
completely inaccurate: that's not informing the layman in terms he 
understands, that's misleading him.
Only if you don't follow it up with a second sentence.

I also think it is a good idea to include the word 'encoding', 
because if the rest of one's description is simple it can be a 
useful way to plant new terminology in someone's head.
Honestly it depends what kind of layman you are talking to. Many's 
the time I was beavering away on some proposal or other down the pub, 
and have been accosted with a "what are you doing?"
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: a little more help understanding diacritical encoding

This is likely an issue with whatever you are using to read and echo back the 
characters. If you just push the exact same bytes back then you will be okay, but 
anything more clever gives you an opportunity to go wrong - especially if you are 
using an API that thinks it knows better than you do.

What are you using to write and run this code?

Re: About that alphabetician...

2003-09-25 Thread Rick McGowan

Michael wrote:

> I was asked how I describe it briefly to laymen. And I usually say
> "Unicode is like a big, giant font that is supposed to contain all
> the letters of all the alphabets of all the languages in the world."

Now, why do you suppose he removed *that* "like" and, like, left in all  
the others!?

Rick

Re: About that alphabetician...

2003-09-25 Thread Eric Muller







Michael Everson wrote:
An Irish colleague
here said he liked the article but noted that the Times' web directors
don't use Unicode
  
  
  ...



...


  

There is an alternative point of view, which says that charset declared
in an HTML (or XML) document is no more than an encoding scheme, and
that all characters in those documents are fundamentally Unicode
characters (i.e. they start in life with the full semantic of Unicode,
they don't inherit it on the occasion of character set conversion).
That view is supported by the XML spec itself, and by the infoset
definition. And because we have numeric character entities, using an
iso-8859-1 encoding scheme is not really a limitation: witness this
message, which contains U+10DB მ GEORGIAN LETTER MAN and U+092E म
DEVANAGARI LETTER MA.

Eric.

a little more help understanding diacritical encoding

2003-09-25 Thread Steve Pruitt

Thanks for the excellent responses.  I now understand how C3 and 89 are derived.  I 
tried getting everything set the way I intrepreted what the list responses said to do. 
 The scenario is:
I have a page with some diacritical characters displayed and a input text box and a 
submit button.  I copy and past one of the displayed characters into the input box and 
then submit.  What is submitted gets echoed back.  The pages use style sheets so I cut 
and pasted the relevant tags, etc.

I thought I found the problem.  My response had a character encoding of null.  I read 
null defaults to 8859-1 which seemed consistent with my echoed page.  So, I explicitly 
set the response character encoding to UTF-8 via the setContentType method.

I used a TCP tunneler to see what my request and responses look like.  My browser is 
set to utf-8 also.

>From the tunneler my request had the following posted data:  v904=%C3%89   this is 
>correct according to how the utf encoding algo was explained.

The http response had the following:

Content-Type: text/html; charset=UTF-8   this is correct.

  is a child in the 
 tag

É ê ë í î ï ð ñ ó 
ô õ ö  these are the listed characters on the previous page I 
cut and past from they are listed on this page just for reference - (#201 = C9) is É.

Accented Characters from  previous 
form:  Ã‰ 
this is echoed back.  #195 = C3 and #137 = 89.  These, of course, are displayed as Ã?.

I checked the browser to be sure and its encoding is still set to utf-8 and it is.  
This is everything I know to check.  What am I missing?

RE: About that alphabetician...

2003-09-25 Thread John Hudson

At 07:11 AM 9/25/2003, Hart, Edwin F. wrote:

I like to say, "Unicode and ISO/IEC 10646 describe a single standard for
representing the world's characters in computers as a series of numbers
(zeros and ones)."
Unicode is an encoding standard for text on computers that allows documents 
in any script and language to be entered, stored, edited and exchanged.

I think it is best to relate the description to what the layman does: he 
types things, and he edits them and he sends them to other laymen. The 'big 
font' thing is a really bad idea because it is completely inaccurate: 
that's not informing the layman in terms he understands, that's misleading 
him. I also think it is a good idea to include the word 'encoding', because 
if the rest of one's description is simple it can be a useful way to plant 
new terminology in someone's head.

I have not seen the article yet -- too little time with ATypI kicking off 
this evening --, but I'm sure Michael did a grand job otherwise.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
You need a good operator to make type. If it were a
DIY affair the caster would only run for about five
minutes before the DIYer burned his butt off.
  - Jim Rimmer

RE: About that alphabetician...

An Irish colleague here said he liked the article 
but noted that the Times' web directors don't use 
Unicode

Is maith liom an t-alt ach tá díomá orm feiceáil nach bfhuil Unicode in
úsaid ag stiurthóirí gréasáin de chuid NYT.
Seo cód an leathanaigh:




...

...
For the World's A B C's, He Makes 1's and 0's

--
Michael Everson * * Everson Typography *  * http://www.evertype.com

RE: About that alphabetician...

At 10:11 -0400 2003-09-25, Hart, Edwin F. wrote:

It is always a challenge to describe technology in terms that the lay person
can understand.
I like to say, "Unicode and ISO/IEC 10646 describe a single standard for
representing the world's characters in computers as a series of numbers
(zeros and ones)."
Indeed. But the layman knows what a font is, and an alphabet. 
"Characters" has to come in the sentence after. ;-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: About that alphabetician...

And on tis very day, my copy of Unicode 4.0 has arrived. :-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: W3C Objects To Royalties On ISO Country Codes

2003-09-25 Thread Eric Muller

See also .

Eric.

Re: Michael Everson in the news

2003-09-25 Thread Eric Muller

See also , 
which is apparently about SEI.

Eric.

Re: Unicode Normalisaton Optimisation Experiments

On 25/09/2003 12:27, [EMAIL PROTECTED] wrote:

Is this actually correct? For example, if I have in my data the string
(which I know is garbage, but that is irrelevant), that

will decompose and reorder to , as U+05B0 has a

higher combining class (202) than U+05B0 (10). What does this become in
NFC? Is the reordering reversed and the combination reapplied?

First an attempt is made to compose U+0041 and U+05B0. There is no character allowing for this, so that attempt will fail. Then an attempt is made to compose U+0041 and U+0328 which will produce U+0104. U+0041 is replaced with U+0104 and U+0328 is removed resulting in .

It's not a reordering per se, as the first combining character is given the first "opportunity" to combine.

Thanks for the clarification.

This is not only a theoretical issue as the same applies to some real
combinations. There was discussion only last week on the bidi list of a
form which might be encoded but which would be

messed up if composed into .

Yes, NFC would perform that composition. Are you sure it would be an issue?
Applying bidi rules doesn't seem to make this an issue.

bidi: Al, NSM, NSM
applying rule W1 from USA9:
Al, NSM, NSM -> Al, Al, NSM -> Al, Al, Al.

bidi: Al, NSM
applying rule W1:
Al, NSM -> Al, Al
Or is the issue with something else, but it came up on the bidi list?

The problem isn't with the bidi rules but with more general Arabic
shaping etc. There are two issues, one the position of the hamza (in
this case it should be to the left of the sukun) and the other that the
medial form of U+064A has dots below, which are required in this
combination, but the medial form of U+0626 does not. But I think we
concluded that U+0654 alone is not suitable for encoding this particular
hamza.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Michael Everson in the news

2003-09-25 Thread Martin Heijdra

In today's New York Times (Circuits section) there is an article (with
pictures) on Michael Everson and Unicode. Rick McGowan and Deborah Anderson
make guest appearances.

On the net:
http://www.nytimes.com/2003/09/25/technology/circuits/25code.html

Martin J. Heijdra
Chinese Studies/East Asian Studies Bibliographer
East Asian Library and the Gest Collection
Frist Campus Center, Room 314
Princeton University
Princeton, NJ 08544
United States

Re: About that alphabetician...

One complaint:

Very interesting. I didn't realize Unicode was a "large font", 
though... I thought it was a character encoding system, distinct 
from fonts, due to the character/glyph model :)
Another complaint:

Some purist will try to kill you for calling Unicode "a big, giant font ..."
I was asked how I describe it briefly to laymen. And I usually say 
"Unicode is like a big, giant font that is supposed to contain all 
the letters of all the alphabets of all the languages in the world."
--
Michael Everson * * Everson Typography *  * http://www.evertype.com

Re: Fun with proof by analogy, was Re: Mojibake on my Web pages

> >>Suppose you made a document and sent it to me via conventional post.
> >>
> >>The last agent handling the document would be the mail carrier.
> >>Does the mail carrier have the right to open the mailing and
> >>replace your document with garbage?
> >>
> >>
> >>
> >>No, however if I receive a letter in the post written in German I'm going
> to ask someone to translate it rather than try to cope with a language (c.f.
> encoding) I don't understand.
> >>  
> >>
> Yes, if that's what you ask for. But as I know some German I may prefer 
> to do my own translation. And if the recipient is a German who knows no 
> English, they certainly aren't going to be amused if their letters get 
> translated whether they want them to be or not. So the mail carrier 
> should do this only if specifically asked to do so.
> 

Indeed. Remember the problem here isn't a server performing translation, 
transliteration or re-encoding - but rather a server misidentifying an encoding (hence 
my analogy of the translator having a nervous break-down, that and the fact that the 
image struck me as funny).

However to enable a correctly functioning server to perform such re-encoding *when 
asked to do so* we have to have the rule that HTTP-headers over-ride embedded 
self-description for text-based formats. This causes problems in cases like those 
described, but not when the webserver has a rough idea of what the hell it is doing.

One could argue against the rule of headers having precedence on the basis that it is 
brittle, but it is no more brittle than trusting copy-and-paste  elements which 
are also likely to be wrong (trust me I've seen enough that my anecdotal experience is 
approaching statistical validity).

But one day it will all be Unicode... one day...

Re: Fun with proof by analogy, was Re: Mojibake on my Web pages

On 25/09/2003 10:51, [EMAIL PROTECTED] wrote:

Suppose you made a document and sent it to me via conventional post.

The last agent handling the document would be the mail carrier.
Does the mail carrier have the right to open the mailing and
replace your document with garbage?
   

No, however if I receive a letter in the post written in German I'm going to ask someone to translate it rather than try to cope with a language (c.f. encoding) I don't understand.
 

Yes, if that's what you ask for. But as I know some German I may prefer 
to do my own translation. And if the recipient is a German who knows no 
English, they certainly aren't going to be amused if their letters get 
translated whether they want them to be or not. So the mail carrier 
should do this only if specifically asked to do so.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Unicode Normalisaton Optimisation Experiments

> Is this actually correct? For example, if I have in my data the string 
>  (which I know is garbage, but that is irrelevant), that
> 
> will decompose and reorder to , as U+05B0 has a
> 
> higher combining class (202) than U+05B0 (10). What does this become in 
> NFC? Is the reordering reversed and the combination reapplied?

First an attempt is made to compose U+0041 and U+05B0. There is no character allowing 
for this, so that attempt will fail. Then an attempt is made to compose U+0041 and 
U+0328 which will produce U+0104. U+0041 is replaced with U+0104 and U+0328 is removed 
resulting in .

It's not a reordering per se, as the first combining character is given the first 
"opportunity" to combine.

> This is not only a theoretical issue as the same applies to some real 
> combinations. There was discussion only last week on the bidi list of a 
> form which might be encoded  but which would be
> 
> messed up if composed into .

Yes, NFC would perform that composition. Are you sure it would be an issue? Applying 
bidi rules doesn't seem to make this an issue.

bidi: Al, NSM, NSM
applying rule W1 from USA9:
Al, NSM, NSM -> Al, Al, NSM -> Al, Al, Al.


bidi: Al, NSM
applying rule W1:
Al, NSM -> Al, Al

Or is the issue with something else, but it came up on the bidi list?

Re: need help understanding diacritical encoding

> I have a form that posts diacritical characters.Ffor example, when my browser
> has the encoding set to utf-8 and the form posts the character É
> the post data has these two bytes C3 and 89, which when echoed back on a new
> page is displayed as Ã?.  Can someone explain when the character is converted
> to two bytes how I get C3 and 89?
> 

UTF-8 is explained in section 3.9 of the Unicode standard and elsewhere (RFC 2279 is a 
heavily-referenced document, note that its description includes the encoding of 
codepoints outside of the Unicode range).

É is U+00C9 and in binary that is:

11001001

UTF-8 encoding results in different numbers of bytes depending on how many bits you 
have when you remove the leading zeros (8 bits in this case - resulting in two bytes).

It then puts those bits from the codepoint into bytes as so:


 0xxx -> 0xxx
0yyy yyxx -> 110y 10xx
 yyxx -> 1110 10yy 10xx
000u  yyxx -> 0uuu 10uu 10yy 10xx

In the case of U+00C9 the second of these is the shortest form possible, so it is 
used. The bits 00011 are placed in 110y to give you 1111 (0xC3) and the bits 
001001 are placed in 10x to give you 10001001 (0x89).

The problem is that this didn't happen when the bytes went back out again - rather the 
bytes where interpreted as being part of a string encoded in some other way (most 
likely ISO 8859-1, which certainly would produce Ã followed by a control character 
from those bytes). It may be that all you need to do is to correctly report the 
encoding, by sending a HTTP header of the mime-type and charset (some server-side APIs 
make this easy, e.g. in ASP you would use Response.Charset = "utf-8"). It may be that 
you need to do futher work (depending on just what it is you are doing with the form).

Re: Unicode Normalisaton Optimisation Experiments

On 24/09/2003 14:58, Jon Hanna wrote:

... For example since following the decomposition  ->  there can be no character that is unblocked from the U+0041 that will combine with it, hence there is no circumstance in which they will not be recombined to U+0104 and hence dropping that decomposition from the data will not affect NFC (the relevant data would still have to be in the composition table, as the sequence  might occur in the source code).

 

Is this actually correct? For example, if I have in my data the string 
 (which I know is garbage, but that is irrelevant), that 
will decompose and reorder to , as U+05B0 has a 
higher combining class (202) than U+05B0 (10). What does this become in 
NFC? Is the reordering reversed and the combination reapplied?

This is not only a theoretical issue as the same applies to some real 
combinations. There was discussion only last week on the bidi list of a 
form which might be encoded  but which would be 
messed up if composed into .

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Mojibake on my Web pages

> > Maybe including a BOM would help the browser realise something was
> > awry, but it's just as likely to think the author just wrote an
> > invalid document that began with ï»¿

I really have to stop using this web-2-mail app, it managed to mangle my 
representation of a mangled BOM!

> I've been told, hee hee hee, that the one thing I must NEVER NEVER do in
> a Web page is to begin it with a BOM.  But I admit I haven't tried that
> yet.  How funny would that be if it solved the problem?

Well with UTF-16 you really should use a BOM but with UTF-8 there was a bit of a 
debate which finally settled on the opinion that it was OK to do so. However browsers 
are not necessarily going to agree with that opinion. Anyway, it couldn't hurt to try.

Fun with proof by analogy, was Re: Mojibake on my Web pages

> Suppose you made a document and sent it to me via conventional post.
> 
> The last agent handling the document would be the mail carrier.
> Does the mail carrier have the right to open the mailing and
> replace your document with garbage?

No, however if I receive a letter in the post written in German I'm going to ask 
someone to translate it rather than try to cope with a language (c.f. encoding) I 
don't understand.

Besides what is happening here isn't the server replacing the document with garbage, 
it's the server mis-identifying what the document is - analogous either to our 
hypothetical translater having a break-down, insisting that all of our mail was german 
and handing us non-sequitors as "translations", or with the postal service getting the 
delivery wrong (which is something that has certainly happened to my mail).

> 
> An analogy:
> 
> Author = Host
> Document = Wine
> Reader = Guest
> Server = Cup
> 
> If the host pours a cup of wine for the guest, would we allow a
> mere cup to adulterate our wine?

The argument only holds as much as the analogies hold (both the analogy with snail 
mail and the one you actually refer to as an analogy). These analogies do hold in 
certain cases, and the case that started the thread is an example, but it does not 
hold in the general case. In other scenarios better analogies would be:

Author = Scribe
Document = Draft
Reader = em, Reader
Server = Editor.

Or Author = scattered data sources of varying degrees of reliability - Server = 
researcher.

In general, from the browser's perspective the server is the author (which may or may 
not be an accurate view of what goes on "behind the scenes"). Re-encoding, if done 
right, can be very useful in making web documents more widely accessible.

Of course we'll soon be able to just rely on assuming that every step in the process 
can understand UTF-8 and UTF-16 :)

Re: Unicode Normalisaton Optimisation Experiments