from:"Addison Phillips \[wM\]"

RE: CLDR locales: Filipino (fil/ph?) Pilipino/Tagalog (tl/tlg)

2004-12-27 Thread Addison Phillips [wM]

Following draft-langtags (and CLDR usage), it would be "tl-Tglg-PH"

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Philippe Verdy
> Sent: 2004å12æ27æ 11:33
> To: unicode@unicode.org
> Subject: Re: CLDR locales: Filipino (fil/ph?) Pilipino/Tagalog (tl/tlg)
> 
> 
> From: "Philippe Verdy" <[EMAIL PROTECTED]>
> > Now comes the problem of tagging localized resources for the 
> Philipines: 
> > can we use "ph" today? or must we use only "fil" or "fil-PH"?
> 
> I have just been told by a user in the Philipinins that the theorical 
> distinction between Tagalog and Filipinos is rarely observed, 
> even by users 
> in the native Tagalog community: nobody seems to speak today a "pure" 
> Tagalog language, so most computer applications simply do not make the 
> distinction.
> 
> This means that for locale designation in applications, they 
> almost always 
> refer to the "Filipinos" language as a synonym of Tagalog, and they most 
> often don't use the new "fil" code of ISO-639-2 assigned to 
> Filipinos (and 
> incorrectly unified to Pilipinos for terminologic purpose).
> 
> So it seems that Tagalog should be coded this way in ISO-639 (or more 
> exactly applications should behave as if this was coded like this) :
> 
> - English name: Tagalog (modern); alias Filipinos, Pilipinos
> - French name: Tagalog (modern); alias Philippin
> - 2-letter code in ISO-639-1: tl
> - 3-letter code in ISO-639-2 (B/T): tlg/fil
> 
> i.e. the "fil" code should be considered as the terminologic 
> code, and "tlg" 
> used for Bibliographic classification, and "tl" used in locale data 
> (assuming the Latin script)...
> 
> A best-match locale code will then be "tl" or "tl-PH". Historic "pure" 
> Tagalog texts written with the Tagalog script should be tagged with the 
> locale identifiers "tl-Tglg" or "tl-PH-Tglg" (by adding the capitalized 
> 4-letter ISO15924 script code).
> 
> Are there other opinions about this?
> 
>

RE: Ligatures

2004-11-27 Thread Addison Phillips [wM]

I suppose one could construct such a list, but using them to encode text is a 
Very Bad Idea. It is better, for example, to encode the "fi" ligature as the 
letter "f" followed by the letter "i" and let rendering software, fonts, and so 
forth provide the ligature. Encoding ligatures directly will make your life 
harder. For example, most spell checkers will fail the word "final" when it is 
spelled U+FB01 U+006E U+0061 U+006C (that is, fi-ligature followed by "nal"). 
If you are constructing a font, there are lots of good links on the Unicode 
website which include information on how to handle ligation without having a 
code point for every combination of characters you ligate.

I haven't time to write a good quality response right now, but no doubt someone 
will jump in with 37 pages of text about the small amount I've already written 
(please excuse my sarcasm, which isn't directed at you).

PS> Flarn isn't the reference I think it is, is it?

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Flarn
> Sent: 2004å11æ27æ 15:46
> To: [EMAIL PROTECTED]
> Subject: Ligatures
> 
> 
> Can you please give me a list of all the ligatures available? Thanks!
> 
> - Michael Norton (a.k.a. Flarn)
> E-mail address: [EMAIL PROTECTED]
>

RE: (base as a combing char)

2004-11-27 Thread Addison Phillips [wM]

Dear Flarn,

I'm not sure exactly what you're trying to do: your question is a bit sketchy.

Most Unicode characters are "base" characters, in that they do not combine with 
the character preceding them in the character stream based on their character 
properties. Some characters are combining marks. Combining marks generally are 
not used (and thus not rendered) by themselves. They exist to modify base 
characters.

But combining marks are not the only way that a sequence of characters can form 
a "grapheme" (a visual unit of text). Ligatures, for example, are a sequence of 
base characters that form a grapheme. Some languages treat a sequence of base 
characters as a single letter. For example, Dutch sometimes treats the sequence 
"ij" as a single letter (it turns out that there are characters for the letter 
'ij' in Unicode too, but they are for compatibility with an ancient non-Unicode 
character set). Software must be modified or tailored to provide behavior 
consistent with the specific language and context.

For example, when you see the "fi" ligature in English, you naturally expect 
that the two letters "f" and "i" should be treated as separate letters--you 
should be able to place the cursor between them and delete one of them, for 
example. In another language you might find a letter that consists of two base 
characters where you should NOT be able to do that. Similarly, when one has the 
letter 'u' followed by a combining dieresis mark (U+0308), one expects the pair 
of logical characters to behave as a single character--Ã (which is also 
encoded as a precomposed character U+00FC).

So: what application do you have for two base characters treated as a combining 
mark? Then list members might be able to comments with precision on your 
request.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Flarn
> Sent: 2004å11æ26æ 8:49
> To: [EMAIL PROTECTED]
> Subject: 
> 
> 
> I know that there are some combining characters, and a lot of base 
> characters. But, is there any way to use a base character as a 
> combining character? Please help me!
> 
> - Michael Norton (a.k.a. Flarn)
> E-mail address: [EMAIL PROTECTED]
>

RE: Shift-JIS conversion.

2004-11-25 Thread Addison Phillips [wM]

ï


Dear Pragati,
 
You can write your own conversion, of course. The mapping tables of 
Unicode->SJIS are readily availably. You should note that there are several 
vendor specific variations in the mapping tables. Notably Microsoft code page 
932, which is often called Shift-JIS, has more characters in its character set 
than "standard" Shift-JIS (and it maps a few characters differently 
too...)
 
The important fact that you should be aware of: Shift-JIS is an encoding 
of the JIS X0208 character set. UTF-8 is an encoding of the Unicode character 
set. The Shift-JIS encoding contains about 9,000 characters. Unicode 
(and thus its encoding UTF-8) contains about 93,000 characters. There 
will not be a mapping to SJIS for every Unicode character.
 
Hope that helps.
 
Best Regards,
 
Addison
 
Addison P. PhillipsDirector, Globalization 
Architecturehttp://www.webMethods.comChair, W3C 
Internationalization Working Grouphttp://www.w3.org/InternationalInternationalization is 
an architecture.It is not a feature. 

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]On Behalf Of pragatiSent: 
  2004å11æ24æ 21:01To: [EMAIL PROTECTED]Subject: 
  Shift-JIS conversion.
  Hello,
   
    Can anyone please tell me how to convert from 
  UTF-8 to shift-JIS?
  Please let me know if there is any formula to do it other 
  than using readymade functions as provided by pearl. Because these functions 
  do not provide mapping for all characters. 
   
  Warm Regards,Pragati Desai.
   
  Cybage Software Private Ltd.ph(0)- 020-4044700Extn: 
  302mailto: [EMAIL PROTECTED]

RE: My Querry

2004-11-23 Thread Addison Phillips [wM]

Title: RE: My Querry

(B
(B
(BHi Mike,
(B 
(BYou misread my sentence, I think. I did NOT say that C language strings 
(Bare compatible with UTF-8, but rather that the UTF-8 was designed with 
(Bcompatibility with C language "strings" (char*) in mind. The 
(Bpoint of UTF-8 was actually to be compatible with Unix file systems, of course. 
(BBut one stimulus for the encoding was so that the Plan9 operating system 
(Bwouldn't have to rewrite the C libraries to deal with UTF-16 (then UCS-2). In 
(Bother words, my statement is quite correct about the design goals of FSS-UTF, 
(BUTF-8's progenitor. See for example:
(B 
(Bhttp://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt
(B 
(BIf you read carefully, you'll see the desire to protect the null and \ 
(Bbytes.
(B 
(BA NULL character is considered to terminate a char* by many C 
(Bfunctions. I don't see how it helps anything to confuse a new user by bringing 
(Bup the fact that you can't put a NULL character into the middle of a char*. 
(BThis, as you point out, applies equally to ASCII data.
(B 
(BJava's TES was designed to transport Java java.lang.String objects in a C 
(Bchar*. Java strings can contain the character U+ and Java's developers 
(Bwished to allow this character in the middle of a java.lang.String. Hence this 
(Bbit of fudge.
(B 
(BWhen talking to a newbie I purposely omitted all of these glorious but 
(Bpointless details. The point is that UTF-8 can go into your char* just like any 
(Bother multibyte encoding and in contrast with the myth that char* and 
(BUnicode cannot mix. 
(B 
(BRegards,
(B 
(BAddison
(BAddison P. PhillipsDirector, Globalization 
(BArchitecturehttp://www.webMethods.comChair, W3C 
(BInternationalization Working Grouphttp://www.w3.org/InternationalInternationalization is 
(Ban architecture.It is not a feature. 
(B
(B  -Original Message-From: Mike Ayers 
(B  [mailto:[EMAIL PROTECTED]Sent: 2004$BG/(B11$B7n(B23$BF|(B 
(B  10:32To: [EMAIL PROTECTED]Cc: 
(B  [EMAIL PROTECTED]Subject: RE: My Querry
(B  > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] 
(B  On Behalf Of Addison Phillips [wM] > Sent: Tuesday, 
(B  November 23, 2004 9:14 AM 
(B  > One of the nice things about UTF-8 is that the ASCII 
(B  bytes > from 0 to 7F hex (including the C0 control 
(B  characters from > \x00 through \x01f---including 
(B  NULL) represent the ASCII > characters from 0 to 7F 
(B  hex. 
(B      Correct. 
(B
(B  > That is, amoung other things > 
(B  UTF-8 was designed specifically to be compatible with C > language strings. 
(B      Wrong!  
(B  Weren't you paying attention last week?  C language strings are not even 
(B  fully compatible with ASCII.  UTF-8 is fully compatible with ASCII, 
(B  therefore C language strings are not fully compatible with UTF-8.  The 
(B  Java folks devised a TES, which was UTF-8 with one change (and therefore no 
(B  longer UTF-8), which was "designed specifically to be compatible with C 
(B  language strings".  This method apparently upsets some people.
(B      Since the problem 
(B  between C strings and ASCII/UTF-8/(your character set here) is solely the 
(B  inability to handle zero valued character elements, it may be, and very often 
(B  is, practical to use C strings anyway, as zero valued characters are uncommon 
(B  at best in practice, and explicitly disallowed in many 
(B  applications.
(B  /|/|ike 
(B  "Tumbleweed E-mail Firewall " made the 
(B  followingannotations on 11/23/04 
(B  10:34:18--This 
(B  e-mail, including attachments, may include confidential and/or proprietary 
(B  information, and may be used only by the person or entity to which it is 
(B  addressed. If the reader of this e-mail is not the intended recipient or his 
(B  or her authorized agent, the reader is hereby notified that any dissemination, 
(B  distribution or copying of this e-mail is prohibited. If you have received 
(B  this e-mail in error, please notify the sender by replying to this message and 
(B  delete this e-mail 
(B  immediately.==

RE: My Querry

2004-11-23 Thread Addison Phillips [wM]

If you are writing a C program, then the null character can be used to indicate 
the end of a string.

One of the nice things about UTF-8 is that the ASCII bytes from 0 to 7F hex 
(including the C0 control characters from \x00 through \x01f---including NULL) 
represent the ASCII characters from 0 to 7F hex. That is, amoung other things 
UTF-8 was designed specifically to be compatible with C language strings.

Addison

Addison P. Phillips
Director, Globalization Architecture
http://www.webMethods.com

Chair, W3C Internationalization Working Group
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Harshal Trivedi
> Sent: 2004å11æ23æ 3:42
> To: [EMAIL PROTECTED]
> Subject: My Querry
> 
> 
> How can i make sure that UTF-8 format string has terminated while
> encoding it, as compared to C program string which ends with '\0'
> (NULL) character?
> 
> -> Is there any special symbol or procedure to determine end of UTF-8
> string OR just ASCII NULL '\0' is used as it is to indicate that.
> 
> -- 
> Harshal P. Trivedi
> Software Engineer

RE: valid characters in user names- esp. compatibility characters

2004-08-11 Thread Addison Phillips [wM]

Hi Tex,

webMethods has used a (slightly modified) version of punycode for handling generated 
class names in Java in several products very successfully for several years now. The 
slight modification is to subtitute underscore for the dash character (since one is 
illegal in Java class names). Punycode has proven to be exceedingly robust for this 
type of application, although the algorithm is very arcane. 

Our ACE coder doesn't directly impose NFKC or any of the stringprep type preparations. 
In our application of ACEs users create objects visually and we generate Java code 
named after the objects in a process invisible to users. Although NFKC and stringprep 
are reasonable restrictions for IDN, with its peculiar requirements, it doesn't follow 
that it is good for all applications. Punycode (and all other ACEs) are essentially 
transfer encoding schemes for Unicode code points. The ASCII sequences they generate 
are unique to any particular Unicode scalar sequence. 

It's true that logins have many similarities to IDN in terms of requirements, though. 
Just note that there is no reason why an internal algorithm *has* to do both 
stringprep and punycode or has to do stringprep in the IDN way...

I have a whitepaper on the subject which expands (a tiny amount) on webMethods use of 
ACEs that was presented at IUCs twice, the last one being at Unicode 22, called "Four 
ACEs: A Survey of ASCII Compatible Encodings". The PDF is on my personal website 
http://www.inter-locale.com. I can't remember, but I think this one was a substitute 
paper at IUC22, so it probably isn't in the program proceedings.

Hope this helps,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Tex Texin
> Sent: 2004å8æ11æ 18:29
> To: Unicoders
> Subject: valid characters in user names- esp. compatibility characters
> 
> 
> hi,
> 
> 1) I am looking at a set of legacy applications that would like 
> to extend user
> IDs to support international characters.
> It is not possible to update all of the applications 
> simultaneously to fully
> support unicode, so I am considering an algorithmic mapping of the
> international IDs to an ASCII-based encoding and a layering similar to how
> domain names were extended to be international.
> 
> However, I am curious as to whether some Users might read/write 
> their names
> using compatibility characters (esp. in ideographic markets) and 
> object to the
> characters being normalized through nfkc. I thought it might be 
> like someone
> spelling their name incorrectly. I don't know enough about 
> ideographic names or
> the compat. characters to evaluate if it would be perceived as a 
> problem by
> users. If any CJK experts would comment on this, it would be appreciated.
> 
> 2) I am also getting questions about the robustness and stability 
> of the GNU
> libidn implementations of stringprep and punycode which are being 
> considered. I
> would be glad to hear privately if you have used them and what 
> your experience
> was/is.
> 
> tia
> tex
> 
> -- 
> -
> Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
> Xen Master  http://www.i18nGuy.com
>  
> XenCraft  http://www.XenCraft.com
> Making e-Business Work Around the World
> -
>

W3C I18N rechartering: input invited

2004-07-15 Thread Addison Phillips [wM]

Dear Unicoders:

The W3C Internationalization Activity is preparing new Working Group charters in order 
to continue and extend its activities. The goal of the W3C in this area, according to 
the current activity statement, is: "to propose and coordinate the adoption by the W3C 
of techniques, conventions, technologies, and guidelines that enable and enhance the 
use of W3C technology and the Web worldwide with and between the various different 
languages, scripts, regions, and cultures." The new charters are in an early draft 
stage. Ultimately these charters must be reviewed and approved by the W3C Management, 
W3C Advisory Committee representatives, and the Director.

This is an open letter inviting feedback on the charters, including areas that are not 
listed that need W3C attention, the proposed working groups, and the work that you 
would contribute on these projects. Draft versions of the new charters are available 
at the links at the bottom of this message. Links to the current charter and activity 
statement, for comparison, are also provided there. 

In the past, the Activity has been organized as a single Working Group, with various 
task forces that focused on special projects. The draft charters propose two Working 
Groups. The "Internationalization Core Working Group" (Core) will continue its focus 
on developing standards in the area of internationalization and reviewing W3C and 
related technologies for sound international design and support. The 
"Internationalization Guidelines, Education, and Outreach Working Group" (GEO) will 
focus on making the internationalization aspects of W3C technology better understood 
and more widely and consistently used.

Below you will find a summary of the items currently being considered for the Working 
Group charters. Feedback, suggestions and comments are welcome (please use the [EMAIL 
PROTECTED] list for public feedback). Member-only or team-only comments should be 
directed to Martin DÃrst or Richard Ishida.

Best Regards,

Addison Phillips, Chair W3C-I18N-WG
Martin DÃrst, I18N Activity Lead
Richard Ishida, Staff Contact, I18N WG

- New Core Work -
The core working group's draft charter includes some new work items. These are 
generally related to the findings of the current Web Services Task Force:

1. WS-International. Develop a WS-* specification for describing in WSDL and for 
exchange in SOAP messages the international context and preferences necessary to 
enable internationalized Web services.

Deliverable: Develop a specification and promote to Candidate Recommendation status.

2. Locale and Language Identifiers on the Web. Recent IETF activity is in the process 
of replacing the existing standard for language identifiers (RFC 3066), which has been 
widely cited in W3C specifications (HTML lang, XML xml:lang, XML Schema, XML Query, 
RDF, and many others). The changes in the RFC on which these are based require that 
policies and guidelines be developed for current and future specifications.  In 
addition, in order to enable multi-locale operation of Web services and to create a 
locale negotiation layer for them, there needs to be a standardized method for 
identifying locales and locale preferences on the Internet.

Deliverable: Develop a guidelines document for incorporating the RFC 3066 replacement 
into W3C technologies to be published as a Note. 
Deliverable: Develop a specification for locale identification on the Web and promote 
to Candidate Recommendation status.

3. Collation Identifiers. W3C technologies have struggled with the problem of 
collation, sorting, and ordering on the Web. These efforts have been hampered by lack 
of a clear, open, consistent, and standardized way of referring to collation 
sequences. Efforts to produce an Internet-Draft need to be re-energized

Deliverable: Coordinate efforts to develop an Internet-Draft and submit to the IESG on 
the IETF Standards Track.


- Continuing Core Work -
The core working group's draft charter includes these items of continuing work:

1. Achieve 'Candidate Recommendation' status for the 'Character Model for the World 
Wide Web 1.0: Normalization' (CharModNorm)
2. Achieve 'Proposed Standard' status at the IETF for 'Internationalized Resource 
Identifiers' (IRIs)
3. Finalize 'PR' status for the 'Character Model for the World Wide Web 1.0: 
Fundamentals' (if required: we are approaching this stage now)
4. Review of new W3C specifications and technologies.
5. Liase with other standards organizations as appropriate.

- GEO Work -

The GEO working group's charter is more outwardly focused and includes:

1. Web Resources. The WG will continue to publish material on the 
http://www.w3.org/International/ site to assist users of W3C and related Web 
technologies to internationalize their approach.

2. Techniques. The WG will build on existing work to create a series of documents 
providing internationalization-related advice for users of Web technologies. Thes

RE: number of bytes for simplified chinese

2004-06-28 Thread Addison Phillips [wM]

Hi Duraivel,
Your question is incomplete. There are several Unicode encodings to
choose from and the "number of bytes" question is influenced by your choice of
encoding, as well as by the data you choose.
For example, UTF-8 is a multibyte encoding of Unicode, where each
character is 1-, 2-, 3-, or 4-bytes long, depending on the character. The
majority of characters written in Simplified Chinese will be three bytes long in
this encoding.
UTF-16 encodes characters using two bytes per character for the vast
majority of characters in most sets of data. Some Chinese characters are encoded
on higher (or "supplemental") planes of Unicode and will require two
two-byte characters (a "surrogate pair") to access them in UTF-16. These
characters are generally considered to be quite rare in "average" data and it is
unlikely that your data will contain more than a few of these characters in any
event.
Probably, though, you are not starting your question in the right place.
Why do you care about the number of bytes in a character? The reasons you give
will determine whether a specific encoding is more (or less) suited for use than
another encoding (or even character set, such as a legacy, non-Unicode,
character set/encoding). For example, if you are trying to determine whether
Unicode is more (or less) efficient than a legacy solution, then I think you'll
find that the performance issues are somewhere other than the average byte count
per character. If you are worried about storage (disk, database, etc.), then the
specifics of your situation will determine what the "right answer" may be for
you.
Best Regards,
Addison
Addison P. PhillipsDirector, Globalization
ArchitecturewebMethods | Delivering Global Business Visibilityhttp://www.webMethods.comChair, W3C Internationalization
(I18N) Working GroupChair, W3C-I18N-WG, Web Services Task Forcehttp://www.w3.org/InternationalInternationalization is
an architecture.It is not a feature.

-Original Message-From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]On Behalf Of
DuraivelSent: 2004å6æ27æ 23:38To:
[EMAIL PROTECTED]Subject: number of bytes for simplified
chinese

hi,
I would like to know the number opf bytes
required for simplified chinese language. Can we represent all the characters
of simplified chinese in unicode using just two bytes.
regards
duraivel

RE: Latin Characters

2004-06-24 Thread Addison Phillips [wM]

ï



Philippe wrote:

In Windows standard keyboard 
drivers, there's no integrated support to enter Unicode codepoints transparently 
in all applications.
I wonder if there exists a keyboard 
driver that would support ALT+NumPadPlus followed by 1 to 4 hex digits to enter 
at least UTF-16 code points (possibly extended to enter 5 or 6 hex digits for 
easy access to supplementary planes, by splitting internally the supplementary 
code point into surrogates when feeding applications which will only treat 
surrogates).

The 
"Chinese (Taiwan) - 
Chinese Traditional Unicode" 
keyboard allows you to type UTF-16 hex sequences 
into ANY Windows application transparently. It's available on Windows 2000 and 
later. Supplemental characters have to be entered as surrogate pairs, of course. 
It's easy to install from the Regional Options control panel. You can find 
instructions in my little document "Learn to Type Japanese and Other 
Languages" here:
http://www.inter-locale.com/whitepaper/LearnToType.pdf  
(Acrobat PDF format)
Addison P. PhillipsDirector, Globalization 
ArchitecturewebMethods | Delivering Global Business Visibilityhttp://www.webMethods.comChair, W3C Internationalization 
(I18N) Working GroupChair, W3C-I18N-WG, Web Services Task Forcehttp://www.w3.org/InternationalInternationalization is 
an architecture.It is not a feature. 

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]On Behalf Of Philippe 
  VERDYSent: 2004å6æ24æ 7:36To: Joe Speroni; 
  [EMAIL PROTECTED]Subject: Re: Latin 
  Characters
  
  "Joe Speroni" wrote:
  > I was reminded that ALT+UNICODE 
  (decimal) inputs characters into Windows (...)
  Wrong statement here: ALT+decimal 
  is supported in keyboard drivers but is limited to enter ONLY decimal 
  values between 1 and 255. This value can be entered and will be 
  interpreted in two ways:
  * 
  The decimal code specified is not the Unicode code point, but the OEM charset 
  supported by your selected keyboard driver (typically codepage 437 in US, 850 
  in Western Europe). This is provided for compatibility with Console apps 
  emulating DOS that use the currently selected DOS codepage (see the command 
  CHCP which allows setting the codepage used in the Console; other GUI Windows 
  apps use the currently selected OEM codepage).
  * 
  If you prefix the decimal code by entering a zero digit key, the decimal value 
  is interpreted by your keyboard driver as the local "ANSI" codepage of your OS 
  (the codepage cannot be changed at run-time like the OEM codepage, and is 
  fixed in your local Windows installation). This code is then NOT Unicode (most 
  often Windows 1252). The keyboard driver interprets the decimal value in the 
  Windows ANSI codepage and converts it to Unicode. This makes a difference for 
  all "ANSI" codes entered with ALT+0128 to ALT+0160 (which get mapped to 
  Unicode code points above 255... according to the definition of the Windows 
  ANSI codepage), and if your ANSI codepage is not Windows 1252 (Western 
  European localizations of Windows) with ALT+0161 to ALT+0255 (which will not 
  be converted equally to Unicode codepoints by your Windows keyboard driver's 
  keymap).
  >and discovered 
  UNICODE (hex) +ALT-X inputs into Word.
  Not a Windows standard, this is 
  just an enhanced keybooard function which treats ALT+X specially in Word. 
  Windows itself does not have this feature.
  In Windows standard keyboard 
  drivers, there's no integrated support to enter Unicode codepoints 
  transparently in all applications.
  I 
  wonder if there exists a keyboard driver that would support ALT+NumPadPlus 
  followed by 1 to 4 hex digits to enter at least UTF-16 code points (possibly 
  extended to enter 5 or 6 hex digits for easy access to supplementary planes, 
  by splitting internally the supplementary code point into surrogates when 
  feeding applications which will only treat surrogates).
  This can be developped quite 
  easily as a separate keyboard driver (easily changeable by users in Windows XP 
  which supports multiple language/keymap combinations). But each application 
  has its own use of keyboards, and generally, Windows only provides ALT+decimal 
  and ALT+0+decimal only for compatibility with legacy DOS and Windows 3.x 
  applications, leaving all extra combinations out of the "standard" key 
  mappings (each application may have its own existing keyboard shortcuts which 
  may stop working if this was done by default in keyboard 
  drivers.)
  Instead of forcing new key 
  combinations for easier keyboards, it seems that Microsoft has promoted the 
  definition and use of supplementary keys on keyboards to allow system 
  shortcuts (for example the newer Windows and Menu and Sleep/WakeUp keys, or 
  Internet navigation keys, Media Player controls...) without impacting the 
  existing key mappings for language/plain text use.
  It's noticable that you'll find 
  some other keys on all notebooks that don't exist

RE: Latin long vowels

2004-06-22 Thread Addison Phillips [wM]

The characters you mention exist in Unicode.
They are:
U+014D
U+0113
U+0101
U+012B
U+016B
(those are the lowercase letters, the uppercase versions are 1 less than
the lowercase, so capital O with macron is U+014C). I've typed them into this
message so you can play with fonts:
ÅÄÄÄÅ
ÅÄÄÄÅ
These are all in the Latin Extended-A block. See p 167 of Unicode 4.0.
Many fonts include these characters. Since you mention Windows XP, if you have
Microsoft Office you can install the Arial Unicode MS font that contains nearly
all of the characters in the first plane of Unicode. If you want to find other
fonts, you can use the Character Map utility to explore the character ranges
supplied by various fonts on your system.
Best Regards,
Addison
Addison P. PhillipsDirector, Globalization
ArchitecturewebMethods | Delivering Global Business Visibilityhttp://www.webMethods.comChair, W3C Internationalization
(I18N) Working GroupChair, W3C-I18N-WG, Web Services Task Forcehttp://www.w3.org/InternationalInternationalization is
an architecture.It is not a feature.

-Original Message-From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]On Behalf Of Joe
SperoniSent: 2004å6æ22æ 2:10To:
[EMAIL PROTECTED]Subject: Latin long vowels

I apologize for a simple question,
but after a few hours of âresearchâ I donât seem to be able to find the
characters needed. Iâm trying to scan a Latin text that uses a bar over
the vowels to indicate long sounds. Do these characters exist in
Unicode?

If so, would anyone know from
where a Windows XP font containing these five characters could be download?

Again sorry for such a simple
question when I know more weighty questions are addressed in this
forum.
Aloha,
Joe Speroni
Honolulu
<>

RE: ISO 15924 draft fixes

2004-05-20 Thread Addison Phillips [wM]

I don't care about the order, so long as it is stable over time. Personally I find the 
latter form more logical (with the identifier, i.e. the code, first). I view the 
English and French names and the "PVA" as merely descriptive or informative 
information. The code and the ID number should go first, IMO.

But if the file is in some other format, that's fine, so long as the format is stable.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Michael Everson
> Sent: 2004å5æ20æ 10:59
> To: [EMAIL PROTECTED]
> Subject: RE: ISO 15924 draft fixes
> 
> 
> At 10:00 -0700 2004-05-20, Addison Phillips [wM] wrote:
> >I concur with Peter. If there are multiple 
> >documents now, then I'd like to see a single 
> >normative document...
> 
> It will be the plain-text version, and for the 
> purposes of fixing the current regrettable mess 
> I'm taking it as read that the plain text version 
> was always the normative version.
> 
> >and furthermore I would like it to *be* 
> >normative (and I'd like to know which one it 
> >is). The text file is listed on the web site as 
> >the "alternative"...
> 
> It should say normative.
> 
> Is the format order satisfactory? 
> English_Name;Code;NÂ;Nom_franÃais;PVA;Date
> Or would it be preferable to have it in the 
> format of Table 1 
> (Code;NÂ;English_Name;Nom_franÃais;PVA;Date)
> 
> 
> -- 
> Michael Everson * * Everson Typography *  * http://www.evertype.com
>

RE: ISO 15924 draft fixes

2004-05-20 Thread Addison Phillips [wM]

I concur with Peter. If there are multiple documents now, then I'd like to see a 
single normative document... and furthermore I would like it to *be* normative (and 
I'd like to know which one it is). The text file is listed on the web site as the 
"alternative"...

By all means correct errors. Spelling or nomenclatural (non-substantive) changes in 
the descriptions are errata. But I view changes, additions, and deletions to/from the 
data tables as changes to the standard and they should, in my opinion, be treated as 
such even if they are only to correct errors.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Peter Constable
> Sent: 2004å5æ20æ 8:10
> To: Unicode List
> Subject: RE: ISO 15924 draft fixes
> 
> 
> > >For now I suggest an immediate warning in the ISO15924 web pages,
> > >explicitly stating that these published tables were in beta, and
> > >contain incoherences, which are being corrected.
> > 
> > No. This is purely cosmetic. Let us move on.
> 
> I find this cavalier attitude a bit disconcerting. Errors in the tables
> are not purely cosmetic. An IT standard is created to support IT
> implementations, and people have been and will be referring to those
> tables to create their implementations. Each view of the data should be
> reliable, and if it is found that it was not, then that needs to be
> communicated in some way. 
> 
> IMO, it is essential that there be a place on the site for errata. I'm
> inclined to agree with Philippe: the errata notes should indicate that
> there were errors in the original tables and what the nature of those
> errors were. If IDs were misspelled or missing, those should be
> enumerated. If English or French names were misspelled, I think a
> general note is sufficient.
> 
> 
> > >A link should list the incoherences and the proposed changes. I have
> > >such a list and all it takes for me is a simple Excel spreadsheet,
> > >used to sort the tables and detecting differences between published
> > >tables and proposed corrections.
> > 
> > The only delta we are going to deal with is the one between the
> > plain-text documents; it is that which is going to be considered
> > authoritative
> 
> Is that document*s* (plural)? I strongly encourage you to maintain *one*
> master source from which all others are derived.
> 
> 
> 
> Peter
>  
> Peter Constable
> Globalization Infrastructure and Font Technologies
> Microsoft Windows Division
>

RE: TR35

2004-05-13 Thread Addison Phillips [wM]

Interestingly, the W3C I18N WG published a new working draft of our Web services 
scenarios document just yesterday and some of that document grapples with this 
issue--when and how to exchange locale information and other "international 
preferences", as well as when and how to exchange languuage information. The document 
is here:

http://www.w3.org/TR/2004/WD-ws-i18n-scenarios-20040512/

I think what's interesting is that our document illustrates some of the situations in 
which you might wish to exchange locale information. And I think these illustrations 
go more to prove Peter's point than not. Locale interchange is very important to 
internationalized software. Certainly language tags carry or imply locale information 
in certain situations. Although the concepts are related, it needs to be very clear 
just how much information one can infer from a language tag.

For example, if you read XSLT (see: http://www.w3.org/TR/xslt#convert) and think that 
the "lang" attribute for converting numbers to strings is a locale, then you probably 
haven't read the text closely enough. It really means something more like language (I 
think this particular example illustrates just how fuzzy the edges are pretty nicely.)

Antoine Leca's example is a good one (there is a similar one in the document above, 
donated by Mark Davis), and I think it shows how distributed software needs to have 
locale information in order to produce results that one could deem "correct" (if that 
text were generated by a message formatter, for example). But we shouldn't confuse 
language tagging of the result ("english") with software processing used to produce it 
(that sentence might have been rendered in the locale "de-DE").

So, there are very valid reasons why applications need to transfer locale preferences. 
Check out our group's document (and the forthcoming requirements document) and see if 
you don't agree... but we should be wary of very broad global statements (both "all 
language tags are also locale tags" and "language tags are never locale tags").

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED] Behalf Of Peter Constable
> Sent: 2004å5æ13æ 7:40
> To: Unicode Mailing List
> Subject: RE: TR35
> 
> 
> > Well, it is true that what I really search for is not *exactly* the
> > formatting locale, but rather another wider information, which would
> be the
> > mind setting of the writer.
> 
> Precisely. The locale info only tells you how a number would have been
> formatted by the author's system, not what the author in fact did. When
> you receive a document, being told what the system would have done
> doesn't tell you anything useful. Not unless the document you receive
> was generated by the system -- and I'm guessing that in many such
> situations what's exchanged isn't a document per se but data structures
> in which numbers are in some pre-defined representation not formatted
> for the user.
> 
> I'm not saying that there is never a need to exchange locale-setting
> info. Only that I don't think it's appropriate in general to tag
> documents (by which I don't mean an accounting spreadsheet or an
> order-entry record) for things like number formatting, and so such info
> should not be included in attributes like xml:lang.
> 
> 
> > I have another example, but I cannot expose it here publicly, it is
> related
> > to some proprietary software.
> 
> If something is going on internal to proprietary software, then there
> are no rules. This is only about public interchange.
> 
> 
> 
> Peter
>  
> Peter Constable
> Globalization Infrastructure and Font Technologies
> Microsoft Windows Division
>

RE: multibyte char display

2004-03-15 Thread Addison Phillips [wM]

Hi Manga,

There are two things that you need to check here.

First, is your environment set up to display the non-ASCII characters?
Solaris offers an impressive array of UTF-8 locales which should allow you
to view Unicode data. You can switch to one of these by setting your LANG
environment variable (although this may not solve font problems and other
issues). Use the command 'locale -a' to list the available locales on your
machine and look for one that looks like (for example) 'en_US.UTF-8'. [You
may also be able to use a locale compatible with your data. An EUC-JP
locale, for example, will display Japanese characters on the console.]

Note that changing your locale on Unix isn't the whole solution. You may
have to install fonts appropriate for the language/data (otherwise you'll
see hollow boxes instead of question marks).

Be sure to set LANG before running your Java program. For example:

   %LANG=en_US.utf8; java -cp ...

The second issue you may encounter: is your data actually making it into the
database? If your database is not configured to use a Unicode encoding (or
at least a multibyte encoding compatible with your data), then the question
marks are being created by the database when you store the data originally.

How database encodings are configured and how you retrieve that information
varies by database. I have a whitepaper on
http://www.inter-locale.com/IUC19.pdf (which is rather stale, but has some
useful information). You might check in your Java program to see if you are
getting question marks in your Strings. This would indicate a problem with
the database or (rarely) the JDBC driver configuration.

Finally, you should check your code out. If you are just writing a little
console app and your database is correctly configured, the problem may just
be the locale and setup of your Solaris box as noted above. If you are
having problems with text files, you should check out your use of
OutputStreamWriter to ensure that you control the encoding it uses (and
don't use the default system encoding, which is affected by your runtime
locale). Writing out files as UTF-8 (instead of System.out.println()) will
let you use the native2ascii utility or other programs to investigate the
actual codepoints you are retrieving.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Behalf Of Manga
> Sent: lundi 15 mars 2004 07:08
> To: [EMAIL PROTECTED]
> Subject: multibyte char display
>
>
>   I use UTF-8 encoding in java code to store multi byte characters in the
>   db . When i retreive the multi byte characters from db , i see
>   "?" inplace of the actual multi byte characters. I use solaris os.
> Is there any environment variable which i can set to see the actual
>   characters on my terminal window.
>
> Thanks
>
>

RE: German characters not correct in output webform

2004-01-13 Thread Addison Phillips [wM]

Title: German characters not correct in output webform
ï

Hi Bert,
This is a common problem.
When you do a form submit (POST or GET of data to the server), the
browser encodes the characters being sent using the character encoding that the
page uses. In your case, from the examples you sent, this encoding is Unicode
UTF-8. UTF-8 is a multibyte encoding of Unicode in which non-ASCII characters
take two or more bytes. In this case, the German accented characters each take
two bytes.
When the server receives the data, it decodes the original bytes sent by
the browser. The problem is: what encoding should be used to interpret the
bytes? For historical reasons, most Web servers (include J2EE, .NET,
Apache/Tomcat, etc.) default to using ISO-8859-1 (Latin-1), a single byte
Western European encoding. This is what is happening in your case: each
UTF-8 byte is being treated as a single character, leading to the corruption you
are experiencing. You can see that each German character is interpreted
as a sequence of two bytes.
To fix your problem you must change your server side configuration to
interprets the bytes sent using the same encoding that the form uses (UTF-8 in
this case). This has nothing to do with your _javascript_. What
exactly to do depends on the technology of your web server.
There are too many of these to list here, but you should be
able to do a little searching to find the answer (or write back off list and I
can probably point you to the documentation).
You might want to be aware of the W3C's Internationalization mailing list
(See http://lists.w3.org/Archives/Public/www-international/) and
of the FAQs at http://www.w3.org/International/geo
(alas, the FAQ on this topic hasn't been published yet!)
Best Regards,
Addison
Addison P. PhillipsDirector, Globalization
ArchitecturewebMethods | Delivering Global Business Visibilityhttp://www.webMethods.comChair, W3C Internationalization
(I18N) Working GroupChair, W3C-I18N-WG, Web Services Task Forcehttp://www.w3.org/InternationalInternationalization is
an architecture.It is not a feature.

-Original Message-From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]On Behalf Of Bert
KemnerSent: lundi 12 janvier 2004 22:38To:
'[EMAIL PROTECTED]'Subject: German characters not correct in
output webform
Hi, I've a problem with a
_javascript_ form on a german website. (http://informationservices.swets.de/web/show/id=47553) The input of the
form contains german characters. But
the output (which is generated by submitting the form) does not
display those characters (see example
beneath). My first reaction to this
problem is that Unicode somehow does not translate these german
characters to Windows
(Outlook). Example form output: Form: Kontaktformular Sender: Receiver: [EMAIL PROTECTED] Insertdate: 2/12/2003 Vor- und Zuname::
Birgitta MÃÆÃÂhe Firma / Institution:: ÃÆ?ffentliche BÃÆÃÂcherei Mainz
Berufsbezeichnung:: E-Mail-Adresse:: [EMAIL PROTECTED]
Telefonnummer:: Ihre Fragen und Anregungen:: Wir interessieren uns fÃÆÃÂr eine
Abonnement der Print-Ausgabe der
britischen Tageszeitung "Times". Ist dies ÃÆÃÂber Sie mÃÆÃÂglich und wenn ja zu welchen Konditionen.
(Preis, wann wird zugestellt?
...) Can you help me with this, or suggest something? I really
appreciate your help.
Kindest regards, Bert Kemner,
webmaster, Swets Information Services, Lisse, The
Netherlands
Bert Kemner
Webmaster
Swets Information
Services
P.O.Box 830 2160 SZ
Lisse
Heereweg 347B
2161 CA Lisse
The Netherlands
T +31 (0)252 435 241
F +31 (0)252 415 888
E [EMAIL PROTECTED] www.swets.com

RE: Fonts on Web Pages

2003-12-02 Thread Addison Phillips [wM]

Um...

This specialist list discusses Unicode, which, last I looked, had something
to do with encoding characters. Of course both fonts and Web pages use
Unicode, but that doesn't mean that this list specializes in either. There
are other specialist lists that discuss Web pages, fonts and other such
arcana.

Before asking questions, it is usually a good idea to see if you can find
the answer somewhere. It turns out that the W3C does, in fact, maintain FAQs
and HTML authoring guidelines for international users under the auspices of
the Internationalization Working Group's GEO Task Force, which you can find
here:

http://www.w3.org/International/geo

Alas, the sections and/or FAQs that would answer your specific question
haven't been written yet. Perhaps, when you learn the answer, you might help
contribute to the edification of others?

There are also lists that that specialize in answering questions about Web
technology. For example, [EMAIL PROTECTED]

Information on subscribing to that list is here:

http://www.w3.org/International/core

Finally, there was a discussion of this topic not that long ago on that
list. Here is the link to the first message in that thread in the list
archive:

http://lists.w3.org/Archives/Public/www-international/2003JulSep/0004.html

Your question may not be answered to your satisfaction, since the answers to
this question, like those of most internationalization questions, seems to
start with the phrase "Well, it depends...", but it helps to look in the
right places :-).

Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Arcane Jill
Sent: mardi 2 decembre 2003 08:12
To: [EMAIL PROTECTED]
Subject: RE: Fonts on Web Pages



Aaargh! No it doesn't PLEASE stop filling this thread with stuff which
does not address the original question. I am interested in WEB PAGES.
Nothing else. Not Acrobat Files. Not Word files. Nothing. JUST WEB PAGES. If
you can't do it on a bog standard HTML page, it's not answering the
question.

Frustrated with all these unrelated side-issues, I decided to try Google
instead. (Google often gives better answers about things than specialist
lists!). I found a really good demo of exactly what I was after at
"http://www.truedoc.com/webpages/intro/";. Of course, I still don't know if
this is state-of-the-art, or whether something better has turned up since
then.

If anyone has any further information about how to embed fonts in HTML
files, please let me know.
Thanks
Jill


-Original Message-
From: Raymond Mercier [mailto:[EMAIL PROTECTED]
Sent: Tuesday, December 02, 2003 12:51 PM
To: Arcane Jill
Cc: [EMAIL PROTECTED]
Subject: Re: Fonts on Web Pages


Of course Adobe was designed  to do just the problem you defined,

OFF-TOPIC: Proposed Successor to RFC 3066 (language tags)

2003-11-19 Thread Addison Phillips [wM]

Hi Philippe,

Thanks for the note.

The announcement here was purely informational. This is off-topic to this list and 
thus further comments really should be carried off to [EMAIL PROTECTED] Cross posting 
with this list is a Bad Idea, IMHO. I have not cross posted this note to prevent any 
thread "over there" from escaping back to the Unicode list. I HAVE posted a response 
to your message to you privately, copy that list.

Thanks again for the comments.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

> -Original Message-
> From: Philippe Verdy [mailto:[EMAIL PROTECTED]
> Sent: Wednesday, November 19, 2003 3:51 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Proposed Successor to RFC 3066 (language tags)
> 
> 
> From: Addison Phillips [wM]
> > Please note that there is a discussion list for this topic at:
> [EMAIL PROTECTED]
> >
> > While Mark and I welcome your comments here or privately, off-list, you
> can best be
> > a part of the discussion by joining that list. Join the list by 
> sending a
> request email
> > to:  [EMAIL PROTECTED]
> 
> I note that the language tags proposal includes the following EBNF
> productions for extensions that may be padded after the language code,
> script code, region code, variant code:
> 
> extensions  = "-x" 1* ("-" key "=" value)
> key  = ALPHA *alphanum
> value  = 1* utf8uri
> alphanum  = (ALPHA / DIGIT)
> utf8uri  = (ALPHA / DIGIT / 1*4 ("%" 2 HEXDIG))
> 
> Under this new scheme, the following language tag may be valid:
> "sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0"
> which here would mean: {
> language="sr"; // Serbian
> script="Latn"; // Latin
> region="SP"; // Serbia-Montenegro
> variant="2003";
> extensions="-x"; {
> href="http://www.iana.org/";
> version="1.0"
> }
> }
> 
> However the problem with that scheme is its new use of characters "%" and
> "=". There are a lot of applications that where not expecting 
> something else
> in this field than just alphanum and "-" or "_" or ".", so that 
> the language
> tag could safely be used without specific escaping within URIs 
> (for example
> in HTTP GET URLs) or as options of a MIME type (I take a sample 
> here, which
> may not correspond to an existing option of the "text/plain" MIME type):
> 
> Content-Encoding: text/plain; charset=UTF-8;
> lang=sr-Latn-SP-2003-x-href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0
> 
> This may break the compatiblity of some parsers if such "extended language
> tags" are found there, as there are two "=" signs within the value of the
> "lang=" option.
> 
> For GET URLs, these extra "%" and "=" will need to be URL-encoded to get
> through correctly, as the following would become possible and prone to
> generate form data parsing errors:
> 
> http://www.anysite.domain/process-form.cgi?lang=sr-Latn-SP-2003-x-
href=http%3A%2F%2Fwww%2Eiana%2Eorg%2F-version=1%2E0

I think it's quite strange that these extensions have not used the existing
restricted encoding set to encode them, instead on relying on "%" and "=".
Why not using "_" instead of "=" and "." instead of "%", like this:
"sr-Latn-SP-2003-x-href_http.3A.2F.2Fwww.2Eiana.2Eorg.2F-version_1.2E0"
(same meaning as the first example above).

But at least this draft offers a good starting point to indicate locales
more precisely.

I fully support the new reference to the ISO-15924 standard for the script
code, and for documenting the legal values of variant codes (either a year
with possible era, or a registered tag), as well as clearly indicating that
languages codes should be the shortest ISO-639 codes (is it true for a few
legacy languages which previously were coded with 3 letters and upgraded to
2-letter codes, until there was a policy not to do it anymore in the
future?)

Where does it affect Unicode, I don't know, except in its possible normative
data tables which may contain future language code conditions, or in
Language tags inserted in the Unicode encoded texts. Does Unicode need these
extensions?

RE: Proposed Successor to RFC 3066 (language tags)

2003-11-19 Thread Addison Phillips [wM]

ï


All:
 
Please note that there is a discussion list for this topic at: [EMAIL PROTECTED]
 
While Mark and I welcome your comments here or privately, off-list, you 
can best be a part of the discussion by joining that list. Join the list by 
sending a request email to:  [EMAIL PROTECTED]
 
Best Regards,
 
Addison
 
Addison P. PhillipsDirector, Globalization 
ArchitecturewebMethods | Delivering Global Business Visibilityhttp://www.webMethods.comChair, W3C Internationalization 
(I18N) Working GroupChair, W3C-I18N-WG, Web Services Task Forcehttp://www.w3.org/InternationalInternationalization is 
an architecture.It is not a feature. 

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]On Behalf Of Mark 
  DavisSent: Tuesday, November 18, 2003 10:56 PMTo: 
  [EMAIL PROTECTED]; [EMAIL PROTECTED]Subject: Proposed Successor 
  to RFC 3066 (language tags)
  Addison and I have been working on a proposed successor to 
  RFC 3066 (language tags), which should be of interest to many people on this 
  list.
  http://www.ietf.org/internet-drafts/draft-langtags-phillips-davis-01.txtFeedback 
  is welcome.Mark
  Note: we submitted a PDF version at the same time, which is *way* 
  easier to read than the normal RFC format. We're hoping that it shows up 
  eventually!__http://www.macchiato.comâ 
  à â

RE: Ewellic

2003-11-13 Thread Addison Phillips [wM]

At the risk of prolonging a pointless thread:

A cipher is a transfer encoding scheme. So are barcodes, semaphore flags,
Morse code, and Base64. The point of each of these is to transfer some piece
of plaintext. Unicode is, among other things, a character encoding scheme,
not a transfer encoding scheme. Unicode characters can be encoded in a TES
and not the other way around.

Unicode might encode scripts intended solely as a transliteration. I think
it unwise to make broad statements about what the UTC will or won't do until
it does it, because human languages are such a complex topic. To the extent
that such "rules" are interesting, I suppose they should be discussed, but
much of this thread has dwelt in the area of whether or not there should be
hard and fast rules for or against the UTC doing certain kinds of things in
the abstract. This is pointless and solopsistic.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Behalf Of Jim Allan
> Sent: Thursday, November 13, 2003 8:32 AM
> To: [EMAIL PROTECTED]
> Subject: Re: Ewellic
>
>
> Ken Whistler posted:
>
> > *Ciphers* are orthographies designed (ideally) to map one-to-one
> > against graphemes of a writing system and (ideally) are designed
> > to obscure those graphemes by using non-obvious forms to hide
> > content from casual observers.
>
> I don't think whether a system was "designed to obscure the graphemes"
> is important (at least in respect to whether Unicode should encode a
> script or not). Some discussing this seemed to think it was.
>
> For example Morse code, semaphore flags, braille, and bar codes are
> often implemented in fonts as one-to-one transliterations of the
> corresponding Latin characters. But these systems were not at all
> designed to obscure the graphemes to which they point, but to reveal
> their semantics clearly in situations where normal representations of
> the original graphemes were not as usable.
>
> Perhaps rather than "cipher" one should say that Unicode does not encode
> separately scripts or systems intended solely as transliterations of
> other scripts. Ciphers are a common example of such scripts and systems.
>
> Jim Allan
>
>
>
>
>
>
>
>

RE: Ill-formed sequences (was: Re: UTF-16 inside UTF-8)

2003-11-05 Thread Addison Phillips [wM]

> 
> I assume that by âmultiple UTF-8 sequences that could represent the same
> logical text,â Adobe is referring to non-shortest UTF-8 sequences such
> as  and not to Unicode canonical equivalences or something else.
> No similar warning about âmultiple sequencesâ is given in the sections
> that deal with UTF-16.
> 

I am under the impression that they mean combining sequences.

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture. 
It is not a feature.

RE: UTF-9

2003-10-30 Thread Addison Phillips [wM]

Pub date = 1 April 2003.

I think that's the salient part.

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility
http://www.webMethods.com
Chair, W3C Internationalization (I18N) Working Group
Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International

Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Behalf Of Philippe Verdy
> Sent: jeudi 30 octobre 2003 15:42
> To: John Cowan
> Cc: [EMAIL PROTECTED]
> Subject: Re: UTF-9
>
>
> From: "John Cowan" <[EMAIL PROTECTED]>
>
> > http://panda.com/tops-20/utf9.txt
> >
> > Res ipsa loquitur.
>
> Are there still now platforms where storage bytes are not octets
> but nonets?
> i.e. 9-bit based platforms? If so this proposal makes sense, but
> as a local
> optimization for these platforms. Problems will code if you want to
> interchange this data with the large majority of hosts that can
> handle a 9th
> bit in their bytes.
>
> This means that the interchange would require to send 2 octets to
> represent
> each 9-bit byte without loosing data, or to use a complex bit pattern to
> pack sequences of height 9-bit bytes into sequences of nine 8-bit
> bytes, and
> with a way to interpret the last sequence (Such converters needed for
> interoperability do exist: look for example at the MIME Base64
> algorithm for
> example which is used to pack sequences of 8-bit bytes into serialized
> octets each with 6 significant bits).
>
> UTF-9 seems interesting in this case, but is it worth the value
> as it is not
> interchangeable directly with the most common networking
> technologies? Can't
> you accept to loose 1-bit per storage byte?
>
> What will happen then to a plain-text coded with UTF-9, and that is sent
> through FTP? Do you mean that FTP should use a Base256 converter for 9-bit
> platforms similar to Base64 for 8-bit platforms, to avoid loosing the most
> significant bits of each transfered byte? How the recipient of the file
> supposed to interpret the convereted data? Is it still plain text?
>
> So if the format is not interchangeable, this UTF-9 transform looks like a
> local-only transformation, and locally, each host can use its own
> representation. And why not then a UTF-18 encoding scheme that would avoid
> using UF-16 surrogates for all characters that fit in the first 4 planes?
>
> For me, a UTF-18 encoding would make better sense if local optimization in
> memory is the issue, as it will represent almost all existing Unicode
> characters in planes 0 (BMP), 1 (SMP), 2 (SIP) and 3 (still not used, but
> you may map instead the SSP plane 14 for tags and variation selectors, or
> keep it for later use as SIP2) in one 18-bit code unit... But you'll still
> need a converter to transform it to UTF-8 or a UTF-16 encoding scheme to
> perform any I/O.
>

RE: Beyond 17 planes, was: Java char and Unicode 3.0+

2003-10-16 Thread Addison Phillips [wM]

Maybe I should have said it the other way first: yuck. I don't want any
characters beyond the planes ever. Nor a change in encoding rules.

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)  +1 408.210.3569 (mobile)
mailto:[EMAIL PROTECTED]

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: Philippe Verdy [mailto:[EMAIL PROTECTED]
> Sent: Thursday, October 16, 2003 4:21 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Beyond 17 planes, was: Java char and Unicode 3.0+
>
>
> From: "Addison Phillips [wM]" <[EMAIL PROTECTED]>
>
> > Here's a proposed solution then. I hereby submit it for use on that
> > incredibly distant day in which our oracle fails and a new 1
> million code
> > point script is added to Unicode (e.g. never).
> >
> > When all of the planes less than 16 are full and the possibility of
> > exhausting code points become actually apparent (but not
> before), the UTC
> > should reserve a range of code points in plane 16 to serve as
> "astral low
> > surrogates" and another to serve as "astral high surrogates". UTF-16 can
> the
> > use a pair of surrogate pairs to address the higher planes thereby
> exposed.
> > And we won't all have to muck with our implementations to support this
> > stuff.
>
> Too late for plane 16: it's currently assigned to PUAs...
> Same thing for plane 15.
>
> But such extension space is certainly available in the special
> (spacial? astral? ;-)) plane 14 ...
>
> Which could then be reserved for "hyper-surrogates", referencing
> codepoints out of the 17 first planes, and assigned in a open
> registry for interchangeable semi-private uses, such as corporate
> logographs and other visual trademarks (including the famous
> Apple logo character in the MacRoman encoding, or the extra
> PUAs needed by Microsoft in its OpenType fonts for Office...)

RE: Beyond 17 planes, was: Java char and Unicode 3.0+

2003-10-16 Thread Addison Phillips [wM]

Here's a proposed solution then. I hereby submit it for use on that
incredibly distant day in which our oracle fails and a new 1 million code
point script is added to Unicode (e.g. never).

When all of the planes less than 16 are full and the possibility of
exhausting code points become actually apparent (but not before), the UTC
should reserve a range of code points in plane 16 to serve as "astral low
surrogates" and another to serve as "astral high surrogates". UTF-16 can the
use a pair of surrogate pairs to address the higher planes thereby exposed.
And we won't all have to muck with our implementations to support this
stuff.

Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)  +1 408.210.3569 (mobile)
mailto:[EMAIL PROTECTED]

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Behalf Of Rick McGowan
> Sent: Thursday, October 16, 2003 12:50 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Beyond 17 planes, was: Java char and Unicode 3.0+
>
>
> Before everyone goes jumping off the deep end with wanting to
> reserve more
> space on the BMP for hyper extended surrogates or whatever, can someone
> please come up with more than 1 million things that need to be encoded?
>
> Our best estimate, for all of human history, comes in around
> 250,000. Even
> if we included, as characters, lots of stuff that is easily unified with
> existing characters, or undeciphered, or just more dingbatty blorts, it
> comes up nowhere near a million.
>
> What you see on the roadmap is what we, in over 12 years of searching,
> have been able to find. I challenge anyone to come up with enough
> legitimate characters (approximately a million of them) that
> aren't on the
> roadmap to fill the 17 planes.
>
> Thanks.
>   Rick

DEC-MCS mapping, anyone?

2003-10-10 Thread Addison Phillips [wM]

I'm looking for an authoritative mapping table of this old VMS encoding to Unicode and 
can't find one anywhere. Anyone got one handy?

Thanks in advance!

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)  +1 408.210.3569 (mobile)
mailto:[EMAIL PROTECTED]

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws 

Internationalization is an architecture. 
It is not a feature.

Re: Euro Currency for UK

2003-10-09 Thread Addison Phillips [wM]

Hmm.. this isn't really a Unicode question. You might want to post this 
question over on the i18n programming list '[EMAIL PROTECTED]' 
or on the locales list at '[EMAIL PROTECTED]'.

You don't say what your programming or operating environments are. There 
are two possibilities here.

If you want to use your existing software to display currencies as the 
Euro instead of pounds, you can generally either set the display 
settings (Windows Regional options control panel) for currency to "look 
like" the Euro. Or you can set (on Unix systems) the LC_MONETARY locale 
variable to some locale that uses the Euro with English-like formatting. 
A few systems actually provide a specialized variant locale for 
[EMAIL PROTECTED] for this purpose. A few provide an [EMAIL PROTECTED], which won't be 
helpful to you because of differences in the separators used in the two 
locales.

You can also compile your own locale tables on Unix. Read the man pages 
on locale.

If you are writing your own software, then it really isn't that hard. 
Some programming environments, such as Java, provide either a separate 
Currency class with the ability to create specific display-time formats 
that take both the currency and the display locale into account. Others 
require you to create a formatter to convert the value into a string for 
display.

In fact, when working with currency it is important to associate which 
currency you mean with the value. You may experience problems if you 
create a data field for "value" and format it according to the machine's 
runtime locale. The runtime locale can imply a certain default currency, 
as you note, but "default" does not mean "only". Consider:

123.45

Not right:

en_GB: £123,45
en_US: $123.45
de_DE: €123,45
ja_JP: ¥123
Most commonly the ISO4217 currency code is associated with a value to 
create a data structure that is specific:


  123.45
  EUR

en_GB: €123,45
en_US: €123.45
de_DE: €123,45
ja_JP: €123.45
Getting the formatting right is a matter of accessing the formatting 
fucctions of your programming API correctly. Most programming 
environments provide a way to format a value using separate locale rules 
(for grouping and decimal separators) and currency.

More information about what you're trying to do would help in 
recommending a solution.

Best Regards,

Addison

--
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487  mailto:[EMAIL PROTECTED]
---
Internationalization is an architecture. It is not a feature.
Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws

RE: Web Form: Other Question: Java Tool

2003-09-19 Thread Addison Phillips [wM]

Hi Anne,

This is a common problem, which has a solution on two different levels. The
problem you have would be more succinctly phrased as "Java knows what the
characters are, but doesn't have a picture of each character to show you.
You need to supply the 'picture'."

You have the right solution, which involves the font.properties file.

First there is the problem of what you do when developing the product (so
you can see that it is working correctly). If you are working on Windows and
own a copy of MS-Office, the easiest thing to do is install the font "Arial
Unicode MS" and modify your font.properties file appropriately (there are
other such fonts in the world... I'm sure you can find them if you look, but
this one is common to have). From that point on you'll be able to see the
characters in your JTable.

Secondly is what you do for your end users. You'll have to provide the end
users with instructions about how to identify appropriate fonts and install
them into their font.properties files.

Here's how you modify your font.properties. For each logical font "block",
you can add the additional font(s) like this:

dialog.0=Arial,ANSI_CHARSET
dialog.1=WingDings,SYMBOL_CHARSET
dialog.2=Symbol,SYMBOL_CHARSET
dialog.3=MS PMincho,SJIS_CHARSET # example of Japanese font
dialog.4=Arial Unicode MS#Note the spaces!


Then you need to add a reference to the font's disk name, like this:

# Font File Names
#
filename.Arial=ARIAL.TTF
filename.Arial_Bold=ARIALBD.TTF
filename.Arial_Italic=ARIALI.TTF
filename.Arial_Bold_Italic=ARIALBI.TTF
filename.Arial_Unicode_MS=ARIALUNI.TTF   # Note the underscores here!
filename.MS_PMincho=MSMINCHO.TTF

For some fonts (but not Arial Unicode) you must modify the font converters
and exclusion ranges:

# Component Font Character Encodings
#
fontcharset.dialog.0=sun.io.CharToByteCp1252
fontcharset.dialog.1=sun.awt.windows.CharToByteWingDings
fontcharset.dialog.2=sun.awt.CharToByteSymbol
fontcharset.dialog.3=sun.io.CharToByteCp932  #Mincho is a Shift-JIS font and
requires conversion. 932 is the Japanese code page.

# Exclusion Ranges
#
exclusion.dialog.0=0500-20ab,20ad-
exclusion.dialoginput.0=0500-20ab,20ad-
exclusion.serif.0=0500-20ab,20ad-
exclusion.sansserif.0=0500-20ab,20ad-
exclusion.monospaced.0=0500-20ab,20ad-

When you are done, save the file as "font.properties" in your jre/lib
directory. Now your program should work the way you expect.

A few more tips:

1. Be sure your code uses logical fonts (that is,
dialog/serif/sanserif/monospaced/etc., not
Times/Arial/Helvetica/Verdana/etc.

2. Be careful of bold and italic. Many Asian fonts do not contain a full set
of glyphs for these styles. You'll have to substitute the plain fonts in
order to see the characters, but they won't be bold or italic or
what-have-you.

Hope this helps.

Best Regards,

Addison


Addison P. Phillips
Director, Globalization Architecture
webMethods | Delivering Global Business Visibility

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)  +1 408.210.3569 (mobile)
mailto:[EMAIL PROTECTED]

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Behalf Of Magda Danish (Unicode)
> Sent: Friday, September 19, 2003 2:13 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: FW: Web Form: Other Question: Java Tool
>
>
> Anne,
>
> I am forwarding your email to the Unicode Public Email list
> http://www.unicode.org/consortium/distlist.html.
> I hope someone will be able to answer your question.
>
> Regards,
>
> Magda Danish
> Administrative Director
> The Unicode Consortium
> 650-693-3921
>
> > -Original Message-
> > Date/Time:Fri Sep 19 05:38:35 EDT 2003
> > Contact:  [EMAIL PROTECTED]
> > Report Type:  Other Question, Problem, or Feedback
> >
> > Dear Unicode-Team,
> >
> > my name is Anne Gleitsmann. My task is to implement a tool in
> > java that administers different Ressource-Bundles. In my tool
> > you can choose a master-document and one or more
> > slave-documents, then follows the data-comparision. The
> > document is being displayed in JTables.
> > Now the language-variety has been expanded to include
> > Japanese and Korean - and that is where my problem is: the
> > font of these languages is shown as little squares.
> >
> > I found the information in the internet that I need to add
> > the range of those characters to the font.properties file -
> > but how do I do this? Could you help me or give me advise as
> > to how to do this?
> >
> > Thank you!
> >
> > Greetings from Germany!
> >
> > Sincerely
> > Anne Gleitsmann
> >
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > (End of Report)
> >
> >

GB18030 mapping table....

2003-08-19 Thread Addison Phillips [wM]

Hi Will,

The ICU library is a good source for information like this. See:

http://oss.software.ibm.com/icu/charset/

The data table is located here:

http://oss.software.ibm.com/cvs/icu/charset/data/xml/gb-18030-2000.xml

Read the note on the first page.

There are official sources as well, but I don't have them bookmarked on this
machine.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc. -- "Global Business Visibility"

432 Lakeside Drive, Sunnyvale, CA, USA
+1 408.962.5487 (office)
+1 408.210.3569 (mobile)
mailto:[EMAIL PROTECTED]

Chair, W3C-I18N-WG, Web Services Task Force
http://www.w3.org/International/ws

Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Behalf Of Magda Danish (Unicode)
> Sent: Tuesday, August 19, 2003 10:32 AM
> To: [EMAIL PROTECTED]
> Subject: FW: Web Form: Other Question: Mapping table
>
>
>
> > -Original Message-
> > Date/Time:Tue Aug 19 13:20:07 EDT 2003
> > Contact:  [EMAIL PROTECTED]
> > Report Type:  Other Question, Problem, or Feedback
> >
> > Dear Sirs,
> >
> > Is there a mapping table from Unicode 4.0 to the GB
> > 18030-2000 standard available for download?
> >
> > Thanks in advance,
> > Will McKee.
> >
> >
> >
> > -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> > (End of Report)
> >
> >

Re: Display and Store Problem!

2003-08-08 Thread Addison Phillips [wM]

Hi Rajesh,

Unless you are using nchar/nvarchar fields for your data, SQL Server is 
using a specific code page (non-Unicode character set) to store your 
data. It is likely that this is Cp1252 (Western European).

If you switch to nchar/nvarchar fields your mapping issues will go away. 
I have a paper that describes some of these issues on the webMethods 
website (it's "Creating Multi-lingual and Multi-locale Databases") here:

http://www.webmethods.com/cmb/solutions/solutions_white_paper_email_entry/

Best Regards,

Addison

--
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487  mailto:[EMAIL PROTECTED]
---
Internationalization is an architecture. It is not a feature.
Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws
[EMAIL PROTECTED] wrote:
Dear professionals,
 
kindly help me.
 
My application is developed by using the JDK1.4 (Swing) on windows 2000 
platform with Arial Unicode MS font. RDBMS used for data stores is MS 
SQL 2000.
 
My problem is when I insert data in hindi language through the 
application into RDBMS, it stores in some different characters, but when 
I retrieve data by puting any query, it retrieves on screen of the 
application appropriately. But, when I insert data in RDBMS directly 
(without using application) it stores in appropriate format, but when 
the same data wants to display by retrieving through the application, it 
displays in different characters.
 
Can anyone clarify me the problem?
 
Thanking you.
 
===
Rajesh Chandrakar, STO - I,
INFLIBNET Centre, UGC,
An Inter-University Centre of UGC,
P.B. No. 4116, Navrangpura,
Ahmedabad, Gujarat (India) 380009
Tel. No.: +91 - (0)79 - 6305971, 6304695, 6308528 (O), 6873805 (R)
Fax: +91 - (0) 79 - 6300990, 6307816
Web: http://www.inflibnet.ac.in
===

Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish andAzeri, was: Accented ij ligatures)

2003-07-15 Thread Addison Phillips [wM]

Phillipe wrote:

>>I hae tried several times to do it. It does not work: you may
>>effectively remove some tables your don't need, but trying
>>to extract just the normalizer is a real nightmare. I tried it
>>in the past, and abondonned: too tricky to maintain, and I
>>retried it recently (one month ago, from its CVS source) and
>>this was even worse than the first time.
webMethods includes the ICU normalizer in a couple of our products. The 
code for one of these products requires JDK 1.2.2, so, since I had to 
compile ICU anyway, I took the time to figure out the dependencies and 
build only what I needed.

The list of classes required for the normalizer is actually quite small. 
Of the 1.3MB ICU4j.jar, only 400K are required for the normalizer to 
operate correctly. Source changes are not required. I will gladly send a 
complete list of classes to anyone who would like it. It took me a day 
to do the work (it took longer to test it than to build it).

Adding the normalizer to the JDK itself would also not be a difficult 
thing for Sun to do: that's because a version of the normalizer is 
already in the JDK, but private.

I will admit that it used to be quite difficult, back in the ICU 1.x 
days, to separate out the normalizer, but I've done that too (for 
reasons I shan't enumerate). I had to modify some source code to make it 
work, but that was mostly because I needed JDK 1.1.x. That JAR file is 
even smaller, at 161K. Building updated data tables is actually easier 
with the old source code...

In any event, you really ought to try the newer versions of ICU4J out. 
They are a lot easier to work with. And a "light" version isn't that 
hard to create, if that's what you want.

Best Regards,

Addison

--
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487  mailto:[EMAIL PROTECTED]
---
Internationalization is an architecture. It is not a feature.
Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws

Re: ISO 639 "duplicate" codes (was: Re: Ligatures in Turkish andAzeri, was: Accented ij ligatures)

2003-07-15 Thread Addison Phillips [wM]

Phillipe wrote:

>>I hae tried several times to do it. It does not work: you may
>>effectively remove some tables your don't need, but trying
>>to extract just the normalizer is a real nightmare. I tried it
>>in the past, and abondonned: too tricky to maintain, and I
>>retried it recently (one month ago, from its CVS source) and
>>this was even worse than the first time.
webMethods includes the ICU normalizer in a couple of our products. The 
code for one of these products requires JDK 1.2.2, so, since I had to 
compile ICU anyway, I took the time to figure out the dependencies and 
build only what I needed.

The list of classes required for the normalizer is actually quite small. 
Of the 1.3MB ICU4j.jar, only 400K are required for the normalizer to 
operate correctly. Source changes required. I will gladly send a 
complete list of classes to anyone who would like it. It took me a day 
to do the work (it took longer to test it than to build it).

Adding the normalizer to the JDK itself would also not be a difficult 
thing for Sun to do: that's because a version of the normalizer is 
already in the JDK, but private.

I will admit that it used to be quite difficult, back in the ICU 1.x 
days, to separate out the normalizer, but I've done that too (for 
reasons I shan't enumerate). I had to modify some source code to make it 
work, but that was mostly because I needed JDK 1.1.x. That JAR file is 
even smaller, at 161K. Building updated data tables is actually easier 
with the old source code...

In any event, you really ought to try the newer versions of ICU4J out. 
They are a lot easier to work with. And a "light" version isn't that 
hard to create, if that's what you want.

Best Regards,

Addison

--
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487  mailto:[EMAIL PROTECTED]
---
Internationalization is an architecture. It is not a feature.
Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws

[W3C-I18N] Working Draft, Web Services Usage Scenarios, Published

2002-12-23 Thread Addison Phillips [wM]

Dear Subscribers,

The Web Services Task Force of the W3C Internationalization Working Group is pleased 
to announce the publication of the first "Web Services Internationalization Usage 
Scenarios" Working Draft at http://www.w3.org/TR/2002/WD-ws-i18n-scenarios-20021220/.

The goal of this document is to gather scenarios for using Web services in ways that 
require attention to internationalization.

This information will be the basis for:

   o creating "best practices" and tutorial information for Web service implementers 
and technology providers

   o proposing standards for Web services and related technologies to support 
international operation

   o demonstrating international issues in existing Web services technologies and 
standards to the groups responsible for their development and maintenance

We invite you to contribute additional Usage Scenarios that document aspects of Web 
Services internationalization that are not yet covered by this document or that extend 
the examples given. To contribute, please use a format similar to the one used in this 
document, and send your contribution to the publicly archived mailing list: 
mailto:[EMAIL PROTECTED]. Please use "[Web Services]" or "[WSUS]" in the 
subject.

Please do not send proprietary material or information.

We also invite you to join this Task Force. Members of the Task Force will have the 
opportunity to influence the internationalization of Web services standards as they 
evolve. You may also wish to join our public mailing list (public-i18n-ws @ w3.org). 
Instructions on joining the Task Force and mailing list are located on our home page, 
located here: http://www.w3.org/International/ws.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature. 

Chair, W3C-I18N-WG Web Services Task Force
To participate see http://www.w3.org/International/ws

[W3C-I18N-WG] Call for Participation, Web Services TF

2002-10-14 Thread Addison Phillips [wM]


Hello Unicadets:

Recently the W3C's Internationalization Work Group completed its rechartering. As a 
result, two new task forces were created to augment the existing "core" activity (of 
creating and reviewing internationalization issues in standards for the Web). One of 
these is the Web Services Task Force, whose focus will be on creating and influencing 
standards related to the area of Web services and the protocols, locales, and data 
structures related to these. I'm writing to you as the chair of the new task force to 
ask for your participation.

This is your chance to become involved and actually do something about globalization 
issues that affect all of us. As with any activity there are different levels of 
involvement, but it is most rewarding if we have many participants at the highest 
level of involvement.

As a member of the Unicode community, you are already interested in globalization 
issues. This task force has the opportunity, for good or ill, to influence how the 
next generation of Internet technologies implement languages, encodings, locales, and 
data structures support users worldwide. It will have the opportunity to set "the 
rules" for things such as language tagging, locale interchange, or preference 
indication on the Web.

Standards organizations like W3C can provide the opportunity to do something about 
problems that you encounter every day. However, a task force is not some mythic body 
composed of annointed experts. This task force needs you to consider contributing time 
and effort in order to be successful. I urge each of you to review the different 
levels of activity listed below and consider participating.

For more information, see http://www.w3.org/International/ws.

1. WG Membership. If you work for a W3C member organization (see [1] below for a list) 
it is quite easy to become a member of the Working Group. This does represent a 
commitment on your part (to participate in teleconferences, group activities, and 
meetings, as well as to contribute to the overall effort of the group), but carries 
with it the greatest involvement and reward. 

As with anything, the greatest level of commitment brings with it the greatest reward. 
You will have full access to and full participation in all of the WS related 
activities of this task force. Not to mention the recognition of your peers and 
colleagues.

2. IG Membership. If you feel unable to make a commitment to the task force, but are 
interested in our activities and you work for a W3C member organization, you can still 
participate on a relatively high level by joining the Interest Group. IG members will 
have access to and the opportunity to participate in some of the activities of the 
main task force without the commitment of any particular amount of time or effort. I 
strongly urge you to consider WG membership if you have at least some time to commit 
to this effort, but IG members will have an opportunity to contribute too.

3. Public Membership. Anyone can join our public mailing list ([EMAIL PROTECTED]) 
where the majority of technical discussion will take place. In addition, we will be 
asking for public contribution of use cases and other materials to support our task 
force's activities. While you won't have access to or a direct voice in the decisions 
and direction of the task force, your voice will be heard and you will have a chance 
to help make the TF a success.

We will consider a few applications to be "invited experts" (that is, people who do 
not work for W3 member organizations) for WG membership, provided you can commit the 
time and effort. This kind of membership is limited because of the way that the W3C 
works. If you are interested in this type of involvement, I do urge you to contact us.

To find out details about this opportunity and details on how to join at any of these 
levels of involvement, you should check our how public page here: 
http://www.w3.org/International/ws

Best Regards,

Addison

[1] http://www.w3.org/Consortium/Member/List

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.

+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature. 

Chair, W3C-I18N-WG Web Services Task Force
To participate see http://www.w3.org/International/ws

RE: UCS-2 and UTF-16

2002-09-12 Thread Addison Phillips [wM]


Yes it is. I use it in reference to Java because Java's
surrogate/supplemental character support is quite limited. It is more
accurate to describe it as UCS-2 support. This isn't to say that valid
UTF-16 sequences are mangled in any way. Just that Java doesn't know what
they are, really.

In this context, UCS-2 vs. UTF-16 is unimportant.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: Stefan Persson [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, September 12, 2002 2:45 PM
> To: Addison Phillips [wM]; Philippe de Rochambeau
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: UCS-2 and UTF-16
>
>
> - Original Message -
> From: "Philippe de Rochambeau" <[EMAIL PROTECTED]>
> To: "Addison Phillips [wM]" <[EMAIL PROTECTED]>
> Cc: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Thursday, September 12, 2002 10:08 PM
> Subject: Re: Problems converting from UTF-8 to UCS-2 and vice-versa using
> JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
>
> > > String ucs2 = new String(byt, "UTF-8");  // turn them into a real
> > > UCS-2 string
> >
> > Isn't UCS-2, UTF-16?
>
> Isn't UCS-2 the same as UTF-16 without surrogate support?
>
> Stefan
>
> _
> Gratis e-mail resten av livet på www.yahoo.se/mail
> Busenkelt!
>
>

RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

2002-09-12 Thread Addison Phillips [wM]


Hi Phillippe,

UTF-16 is (kind of) UCS-2...

What's your system code page? System.out.println uses your system code page
to display characters--it does an implicit conversion. To check your code,
try this:

char[] c = myUCSString.toCharArray();
for (int x=0; xhttp://www.inter-locale.com/IUC19.pdf

Hope that helps.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Philippe de Rochambeau
> Sent: Thursday, September 12, 2002 1:09 PM
> To: Addison Phillips [wM]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: Problems converting from UTF-8 to UCS-2 and vice-versa
> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
>
>
> Hello,
>
> > String ucs2 = new String(byt, "UTF-8");  // turn them into a real
> > UCS-2 string
>
> Isn't UCS-2, UTF-16?
>
> > byte[] byt = myString.getBytes("ISO8859_1");  // get the original
> > UTF-8 bytes back
> > String ucs2 = new String(byt, "UTF-8");  // turn them into a real
> > UCS-2 string
>
> If I do the above, I get the questions marks back, whether I display
> the data this way
>
> out.println( rfLibelle );
>
> or that way
>
> out.println( new String( rfLibelle.getBytes(), "UTF-8" ) );
>
> I think that is something wrong with either JRun 3.1, Windows 2000 or
> SQL Server 2000 (or a combination of them).
>
> I don't any problems with Tomcat 4 + PostgreSQL on MacOSX.
>
> Best regards,
>
> Philippe de Rochambeau
>
> Le jeudi, 12 sep 2002, à 18:33 Europe/Paris, Addison Phillips [wM] a
> écrit :
>
> > For some reason I don't the see the original email, so I'm going to
> > guess based on Marco's response below.
> >
> > The code below is nearly correct, assuming that the starting point was
> > that each UTF-8 byte was converted into a single java.lang.Character
> > object in the String. That is, if the String contained the sequence
> > U+00E8 U+00AA U+009E..., the code would be:
> >
> > byte[] byt = myString.getBytes("ISO8859_1");  // get the original
> > UTF-8 bytes back
> > String ucs2 = new String(byt, "UTF-8");  // turn them into a real
> > UCS-2 string
> >
> > It is very important to name the encoding in the string constructor,
> > otherwise the String constructor assumes the JVM's file.encoding--->
> > most of the time.
> >
> > There is a annoying bug/feature in some JVMs on real Asian Windows
> > (including 2K and XP) in which the file.encoding is ignored in favor
> > of the actual System Active code page (SYS_ACP) and setting the
> > -Dfile.encoding="someEncoding" doesn't work to change the String
> > constructor's default behavior. You have to be careful always name the
> > encoding, not just rely on the system to provide it.
> >
> > If your original byte[] is in a real CJK encoding, then you need to
> > name that encoding instead of UTF-8 above (and you can do that by
> > getting the file.encoding system parameter if you are running on the
> > same platform, la so:
> >
> > byte[] byt = myString.getBytes("ISO8859_1");
> > String ucs2 = new String(byt, System.getParameter("file.encoding"));
> >
> > If the original byte[] is actually correctly formed and you want to
> > get UTF-8, Marco's code is correct:
> >
> > byte[] utf8bytes = myString.getBytes("UTF-8");
> >
> > Note that I have omitted try/catch blocks for clarity, but the
> > compiler will insist on them...
> >
> > Hope that helps.
> >
> > Best Regards,
> >
> > Addison
> >
> > Addison P. Phillips
> > Director, Globalization Architecture
> > webMethods, Inc.
> > 432 Lakeside Drive
> > Sunnyvale, California, USA
> > +1 408.962.5487 (phone)
> > +1 408.210.3569 (mobile)
> > -
> > Internationalization is an architecture.
> > It is not a feature.
> >
> >> -Original Message-
> >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> >> Behalf Of Marco Cimarosti
> >> Sent: Thursday, September 12, 2002 4:51 AM
> >> To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED]
> >> Subject: RE: Problems converting from UTF-8 to UCS-2 and

RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

2002-09-12 Thread Addison Phillips [wM]


For some reason I don't the see the original email, so I'm going to guess based on 
Marco's response below.

The code below is nearly correct, assuming that the starting point was that each UTF-8 
byte was converted into a single java.lang.Character object in the String. That is, if 
the String contained the sequence U+00E8 U+00AA U+009E..., the code would be:

byte[] byt = myString.getBytes("ISO8859_1");  // get the original UTF-8 bytes back
String ucs2 = new String(byt, "UTF-8");  // turn them into a real UCS-2 string

It is very important to name the encoding in the string constructor, otherwise the 
String constructor assumes the JVM's file.encoding---> most of the time.

There is a annoying bug/feature in some JVMs on real Asian Windows (including 2K and 
XP) in which the file.encoding is ignored in favor of the actual System Active code 
page (SYS_ACP) and setting the -Dfile.encoding="someEncoding" doesn't work to change 
the String constructor's default behavior. You have to be careful always name the 
encoding, not just rely on the system to provide it.

If your original byte[] is in a real CJK encoding, then you need to name that encoding 
instead of UTF-8 above (and you can do that by getting the file.encoding system 
parameter if you are running on the same platform, la so:

byte[] byt = myString.getBytes("ISO8859_1");
String ucs2 = new String(byt, System.getParameter("file.encoding"));

If the original byte[] is actually correctly formed and you want to get UTF-8, Marco's 
code is correct:

byte[] utf8bytes = myString.getBytes("UTF-8");

Note that I have omitted try/catch blocks for clarity, but the compiler will insist on 
them...

Hope that helps.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)  
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature. 

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Marco Cimarosti
> Sent: Thursday, September 12, 2002 4:51 AM
> To: '[EMAIL PROTECTED]'; [EMAIL PROTECTED]
> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
> 
> 
> Philippe de Rochambeau wrote:
> > On the other hand, if I store the previous "go" character 
> > plus an unusual 
> > CJK ideogram whose Unicode equivalent is \u5439 (E5 90 B9 in UTF-8) 
> > in the DB and retrieve the data, JRun 3.1 will only display the first 
> > character in my form's textarea, plus a few invisible 
> > characters, and the 
> > database will contain the following hex values:
> > 
> > E8 AA 9E E5 3F B9 20 20 20 20 20 20 0D 0A 0A
> > 
> > As you can see, "go" is still there, but the following 
> > character (E5 3F B9) 
> > is not \u5439 (E5 90 B9). I cannot figure out how to fix this problem.
> > 
> > Any help with this problem would be much appreciated.
> 
> I see what the problem is. As usual, it's all the fault of Bill Gate$. :-)
> 
> If you interpret  according to Windows-1252, you see 
> that E5 is
> "å", B9 is "¹", but 90 is an unassigned slot! Unassigned characters are
> normally turned into a question marks, and "?"'s code is (guess 
> what) 3F...
> 
>  this works only by chance, because all three bytes are valid
> Windows-1252 characters: "é", "ª", and "ž", respectively.
> 
> I guess that the problem starts when you try to fool the system into
> thinking that the text is ISO 8859-1:
> 
>   byte[] byt = (newQfLibelleArray[i]).getBytes( "ISO8859_1" );
>   String tempUtf16 = new String( byt );
> 
> But, sorry. I can't help with a fix, because I don't know Java API's well
> enough.
> 
> Can't you do something like <.getBytes("UTF-8")>? Or, even better, doesn't
> (newQfLibelleArray[i]) have a method to return a  object directly?
> 
> _ Marco
> 
> 
> 
>

RE: Problems converting from UTF-8 to UCS-2 and vice-versa using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1

2002-09-12 Thread Addison Phillips [wM]


How about:

  String myString = null;
  if (newQfLibelleArray[i] instanceof String) {
 myString = newQfLibelleArray[i];  
  } else { // every class has a toString method, the result of
   // which may not be very useful...
 myString = newQfLibelleArray[i].toString();
  }

Best Regards,

Addison

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Marco Cimarosti
> Sent: Thursday, September 12, 2002 5:12 AM
> To: '[EMAIL PROTECTED]'; '[EMAIL PROTECTED]'
> Subject: RE: Problems converting from UTF-8 to UCS-2 and vice-versa
> using JRun 3.1, SQL Server 2000, Windows 2000 and Java 3.1
> 
> 
> I (Marco Cimarosti wrote):
> > [...] doesn't (newQfLibelleArray[i]) have a method to 
> > return a  object directly?
> 
> Perhaps I have been clumsy. By "returning a  object directly" I
> meant, can't you so something like this:
> 
>   String tempUtf16 = new String( (newQfLibelleArray[i]) );
> 
> Or perhaps:
> 
>   String tempUtf16 = new String( (newQfLibelleArray[i]).getString() );
>   String tempUtf16 = new String( (newQfLibelleArray[i]).getText() );
>   String tempUtf16 = new String( (newQfLibelleArray[i]).String() );
>   String tempUtf16 = new String( (newQfLibelleArray[i]).Text() );
> 
> Or whatever the actual method's name is.
> 
> _ Marco
> 
>

RE: FW: New version of TR29:

2002-08-20 Thread Addison Phillips [wM]


How about "I'll" or "it's".

Regards,

Addison



> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of John Cowan
> Sent: Tuesday, August 20, 2002 4:40 AM
> To: Marco Cimarosti
> Cc: '[EMAIL PROTECTED]'
> Subject: Re: FW: New version of TR29:
> 
> 
> Marco Cimarosti scripsit:
> 
> > The issue is making the error window as narrow as possible. My 
> assumption is
> > that is common words such as "c'", "d'", "j'", "l'", "n'", 
> "qu'", "s'", "t'"
> > or "v'" are more common than edge cases like "prud'homme".
> 
> How about this heuristic:
> 
> Break after an apostrophe that is the second or third letter in the
> word.  Do not break after apostrophes that come later.  This neatly
> handles (I think) all the English, Italian, and Esperanto cases, and
> a good many of the French ones.
> 
> -- 
> John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  
> www.ccil.org/~cowan
> Consider the matter of Analytic Philosophy.  Dennett and Bennett 
> are well-known.
> Dennett rarely or never cites Bennett, so Bennett rarely or never 
> cites Dennett.
> There is also one Dummett.  By their works shall ye know them.  
> However, just as
> no trinities have fourth persons (Zeppo Marx notwithstanding), 
> Bummett is hardly
> known by his works.  Indeed, Bummett does not exist.  It is part 
> of the function
> of this and other e-mail messages, therefore, to do what they can 
> to create him.
> 
>

RE: NLS_LANG for russian characterset in UNIX

2002-08-14 Thread Addison Phillips [wM]




Hi 
Ankur,
 
The 
NLS_LANG environment variable is used for configuring the Oracle database 
products. If you mean that you want to set up your copy of Oracle for Russian, 
you could use:
 
NLS_LANG=RUSSIAN_CIS.
 
where 
== one of the following:
 
CHARACTERSET 
CL8ISO8859P5CHARACTERSET 
RU8PC866CHARACTERSET 
RU8BESTACHARACTERSET 
RU8PC855CHARACTERSET 
CL8MACCYRILLICCHARACTERSET 
CL8MACCYRILLICS
CHARACTERSET 
CL8MSWIN1251
CHARACTERSET 
CL8KOI8R 
 
You 
might also want to consider being a bit forward looking and choosing the Unicode 
character set:
 
NLS_LANG=RUSSIAN_CIS.UTF8  // Oracle 8x
NLS_LANG=RUSSIAN_CIS.AL32UTF8  // Oracle 9x
 
Note 
that you need to install your database instance using a matching or compatible 
encoding before your local setting will work correctly. The instance character 
encoding is chosen at instance creation time and (more or less) cannot be 
changed after that.
 
If you 
really mean in your UNIX environment, the environment variable you mean is 
"LANG" and you can use the "locale -a" function to find out what Russian locales 
you have installed. You will need to set up your UNIX environment to display any 
Russian characters that you retrieve from the database, so you'll probably end 
up setting both (again, UTF-8 locales in UNIX work nicely with Unicode encoded 
databases).
 
Hope 
that helps.
 
Best 
Regards,
 
Addison


  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]]On Behalf Of Ankur 
  MahajanSent: Wednesday, August 14, 2002 12:50 AMTo: 
  [EMAIL PROTECTED]Subject: NLS_LANG for russian characterset in 
  UNIX
  Any clue if i want to use RUSSIAN  
  characterset in UNIX environment, what should i set in .profile for 
  NLS_LANG
  like for american english, 
  it is  NLS_LANG=AMERICAN_AMERICA.WE8ISO8859P1
   
  so what should be the same setting for russian 
  charset ??
   
  Regds,Ankur MahajanAsst. Systems 
  Engr.Tata Consultancy ServicesGurgaonPh. 0124-6342944/941/542 Ext. 
  112/1159811324429

RE: (long) Making orthographies computer-ready (was not Telephoning Tamil)

2002-07-29 Thread Addison Phillips [wM]


>
> > One that occurs to me might be the Khoisan languages of Africa,
> > which I believe commonly use "!" (U+0021) for a click sound.
> > This is almost exactly the same problem you are describing for Tongva.
>
> U+01C3 LATIN LETTER RETROFLEX CLICK (General Category Lo) was
> encoded precisely for this. It is to be *distinguished* from
> U+0021 '!' EXCLAMATION MARK to avoid all of the processing problems
> which would attend having a punctuation mark as part of your letter
> orthography. A Khoisan orthography keyboard should distinguish the
> two characters (if, indeed, it makes any use at all of the exclamation
> mark per se), so that users can tell them apart and enter them
> correctly.
>
Amazing! It is there (and has been "forever", since it has a Unicode 1.0
name) and doesn't even normalize to ol' U+0021. Nonetheless, I suspect that
the exclamation mark's origin was in the use of ASCII for the otherwise
unrepresented sounds and that the "should" in your note remains at least
somewhat unrealized. A brief Googling of Khoisan produces pages that use !,
#, //, and ' for the clicks encoded by U+01c0->U+01c3, including the Rosetta
Project page which is encoded as UTF-8 (!!), but uses the ASCII characters,
not the specially encoded variants cited.

Of course, none of the sites I searched was actually IN one of these
languages. Every one that I saw was in English (one had a link to an
Afrikaans page). Perhaps the various Khoisan peoples who have web pages are
using the Unicode characters in question. But the likely prevalence of
English (or at least Western European) keyboards and systems probably has
encouraged the widespread non-adoption of the correct characters (hence,
this may be the example that proves the rule, although I can't think of
anything else that looks more like a click than a bang or an octothorpe ;-).

Best Regards,

Addison

RE: (long) Making orthographies computer-ready (was not Telephoning Tamil)

2002-07-29 Thread Addison Phillips [wM]

I know, hence the jocular tone with wink-and-smile. You are much more likely
to get people's attention if you have a by-god-two-letter code than if you
don't. (Today) you just can't ignore the perception that two-letter codes
are somehow "legit" and three-letter codes somehow aren't... and that too
many locale structures are based explicitly on the two-letter flavor.

On the other hand, I suspect that the two-letter dogma is more past-history
than actual technical requirement. For example, there are real Solaris
locales with names like "japanese". Java allows you to ask for/construct a
locale with any pair/trio of strings (said locale doesn't have any meaning,
since you can't populate the data files). And so on. Just because no one
makes locales using 3-letter codes doesn't mean it isn't technically
impossible. (But it doesn't mean that there is no restriction either.)

Of course, I understand why a company might make a business decision not to
make and support a locale for a language that doesn't qualify for a
two-letter code. Lack of compelling business reasons to build, change, or
test support for minority languages is more a limiter here probably than
active engineering work preventing it.

Addison

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of [EMAIL PROTECTED]
> Sent: Monday, July 29, 2002 3:02 PM
> To: [EMAIL PROTECTED]
> Subject: Re: (long) Making orthographies computer-ready (was *not*
> Telephoning Tamil)
>
>
>
> On 07/29/2002 03:56:36 PM "Addison Phillips [wM]" wrote:
>
> >Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you
> will
> >note that almost without exception the entries are locale driven. The
> first
> >stop in creating a new orthography (or computerizing an existing one,
> perhaps
> >from the days of the typewriter), for my money would probably be to get
> ISO-639
> >to issue the language a 2-letter code so you can have locale (and Unicode
> >character database) data tagged with it ;-).
>
> OK, now you've hit a hot button: The industry needs to wake up to the fact
> that the requirement that a language have an ISO-639 2-letter
> code before a
> locale can be created is a dead end. There just aren't enough 2-letter
> codes to go around, and ISO 639-2 has restrictive requirements for doling
> out 2-letter codes -- it wasn't created for the benefit of locale
> implementers, but for the benefit of terminologists. Luiseño and Tongva
> simply are not candidates. This very issue was raised with the
> relevant ISO
> committee in relation to Hawaiian: a 2-letter code was requested
> specifically because someone was trying to get a Unix implementation
> developed and was told by the engineers that it couldn't be done
> without an
> ISO 2-letter code. Well, I'm pretty sure Hawaiian isn't going to get it,
> because it doesn't meet the requirements for ISO 639-1. Instead of asking
> for a 2-letter code, the engineers should have been looking at what it
> would take to make the software support a 3-letter code (which already
> exists in ISO 639-2).
>
>
>
> - Peter
>
>
> --
> -
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
>
>
>
>
>

RE: REALLY not Tamil - changing scripts (long)

2002-07-29 Thread Addison Phillips [wM]

Keld wrote:
> > It's *much* easier -- and, in the long term, safer -- for them to
> > select from the extensive inventory of characters available in
> Unicode and
> > to avoid using ASCII punctuation characters with redefined word-building
> > semantics.
>
> I don't get what you are saying here, why should people be limited to
> ASCII punctuation characters? With GNU libc you can declare your own set
> of punctuation characters in the locale, and they can be any 10646
> character. Or are you referring to the specific locale syntax from
> POSIX/TR 14652?
>
I think what Peter is saying is that, although you CAN create an orthography
that uses any combination of stuff, it is a bad idea to ignore the Unicode
character properties and use whatever comes to hand (like punctuation
symbols).

Yes, you can program a single C program (or Java program, or what-have-you)
to know how to process your text. But you still face the enormous amount of
software that *doesn't* understand. In the ideal world, you can use your
orthography in Microsoft Word (or StarOffice if you prefer) and not have the
grammer checker destroy your text automagically. Building an orthography
that recognizes this make more sense than having to write "TengvaWord" and
"TengvaExcel" and "TengvaMail" and so on.

Programs that run on platforms that have user-defined locales can get some
of this from providing a locale to use (and switching to it), but there is
always the risk that a programmer has taken a "shortcut" somewhere and is
looking for @ or ! or whatever (for example, if I type an @ in certain
programs surrounded by ASCII text, the program will convert it to a mailto:
hyperlink--yuck!).

In short, the question isn't whether something is or isn't possible, but
rather whether something is or isn't a good idea or desirable. If you've had
an orthography since 1937 based on locally available typewriters, then you
probably won't want to change. If you have NO writing tradition, you're
better off avoiding unnecessary headaches, IMO.

Addison

(long) Making orthographies computer-ready (was not Telephoning Tamil)

2002-07-29 Thread Addison Phillips [wM]

There are always consequences...

... but I am saying that you could build a locale that would work. Generally speaking, 
most programming environments do not look at the Unicode character database for the 
operations in question, or at least, don't look directly that those tables. They use 
custom generated tables or code. For example, from what I know of Java's internal 
structure, it would be relatively easy to construct the necessary classes.

For example, you can create a rule string for RuleBasedCollator that does collation of 
@, since the collator doesn't look at the character properties when performing sorting 
(normalization is another matter, though). A BreakIterator can be fashioned that 
doesn't break on the @ character. Localized strings (as in DateFormat's list of month 
names, for example) are just strings. And so on.

The consequences would generally come into play when you encounter code that DOES look 
at Unicode properties (or looks at a table that is not locale-driven). You'll get 
transient failures in that case.

IOW> the Unicode properties are not just guides. Building "complete Unicode support" 
means taking all the special cases and special pleading into account. Creating a new 
orthography for a minority language should probably take this into account, since what 
one is doing in a small, insular community may be ignored or resisted by Unicode 
implementers, especially if the result cannot be easily fit into existing support 
mechanisms.

The best course of action, if you have the freedom to pursue it, is to choose 
characters that have properties similar to those of the orthographic unit you are 
mapping. "@" has lots of problems (it isn't legal as a "word-part" in a URL, for 
example), it is identified as punctuation (so code that doesn't know about your locale 
may word- or line-break on it), it has no case mapping (so you're at the mercy of 
SpecialCasing, etc.). It is likely that any special cases that you create for ASCII 
characters will be more of an annoyance for Unicode implementers and thus tend not to 
be supported. Avoiding the creation of special cases is a Good Idea.

There are, of course, several orthographies, some with quite large speaker 
populations, that have this potential issue. One that occurs to me might be the 
Khoisan languages of Africa, which I believe commonly use "!" (U+0021) for a click 
sound. This is almost exactly the same problem you are describing for Tongva.

Nonetheless, if you glance at the "SpecialCasing" file in Unicode, you will note that 
almost without exception the entries are locale driven. The first stop in creating a 
new orthography (or computerizing an existing one, perhaps from the days of the 
typewriter), for my money would probably be to get ISO-639 to issue the language a 
2-letter code so you can have locale (and Unicode character database) data tagged with 
it ;-).

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)  
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature. 

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Curtis Clark
> Sent: Friday, July 26, 2002 11:23 PM
> To: [EMAIL PROTECTED]
> Subject: Re: REALLY *not* Tamil - changing scripts (long)
> 
> 
> Addison Phillips [wM] wrote:
>  > Obviously I'm not an expert in these linguistic areas (and hence
>  > rarely comment on them), but it seems to me that the lack of other
>  > mechanisms makes Unicode an attractive target for criticism in this
>  > area.
> 
> Certainly no Unicode-bashing was intended (I'm more of a Unicode 
> evangelist). I guess I'm confused about the use of Unicode character 
> properties. Are you saying that, even though Unicode defines U+0027 as 
> punctuation, other, I could use it as a glottal stop and create a locale 
> that would treat it as a letter (and still be "Unicode compliant", 
> whatever that is?). And if that's the case, are the Unicode properties 
> just guides? Could I develop an orthography where YÃÑØ¨Õ±â would be a 
> word, and there would be no consequences?
> 
> -- 
> Curtis Clark  http://www.csupomona.edu/~jcclark/
> Mockingbird Font Works  http://www.mockfont.com/
> 
> 
> 
>

RE: REALLY not Tamil - changing scripts (long)

2002-07-27 Thread Addison Phillips [wM]


Nothing wrong with it, hence "approximately" ;-)

My hyperbole was getting in the way of accuracy there, as I'm aware of the
Linux locale thing (for that matter, most POSIX systems will let you compile
a locale with a fair bit of effort, and JDK1.5 is rumored to possibly
include user-def locales, so heck...)

Addison

> -Original Message-
> From: David Starner [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, July 27, 2002 12:07 AM
> To: Addison Phillips [wM]; Curtis Clark
> Cc: [EMAIL PROTECTED]
> Subject: Re: REALLY *not* Tamil - changing scripts (long)
>
>
> At 08:46 PM 7/26/02 -0700, Addison Phillips [wM] wrote:
> >That does leave you with the must less happy problem of finding
> a platform
> >with user defined locales (approximately no platforms
> conveniently do this).
>
> What's wrong with Linux's user defined locales? I attach one
> in actual use; while the format isn't the clearest and simplest
> in the world, any computer geek could fix you up a new locale
> given a few hours. You could write a nice GUI program to simplify
> the matter, but it's not something that many users are interested
> about, especially given Unix's flexibilty at mixing features from
> different locales.
>
> --
> David Starner - [EMAIL PROTECTED]

REALLY not Tamil - changing scripts (long)

2002-07-26 Thread Addison Phillips [wM]

I dunno, Curtis. This sounds less like a job for Unicode and more like a job for other 
mechanisms, such as user-defined locales.

Granted that keyboarding is a pain if you choose a character collection that is not 
represented by a convenient keyboard. But the real issues appear to be mostly in 
linguistically related processing (like word breaking, sentence breaking, collation, 
and the like). In most cases these are not something that Unicode per-se can help 
with, but which user-defined locale data could.

Let's take the putative Tongva @ letter as an example. If I had to create a locale in, 
say, Java for it, I could create special casing information (if @ has case), a 
collation table, breaking tables, and the like and nail most of the issues that you 
have. Even loading a "spell checker" is really a locale- or language-related problem 
in most systems today. The main problem would be if you were using @ but actually 
MEANT Ã» or some such. E.g. the Klingon problem, but with a real language.

When that's the case, then you have a case for encoding a new character. But the 
escaping mechanisms in Unicode, like SpecialCasing, seem ample enough to handle 
minority languages like these in all cases where you are just creating an orthography 
using an existing writing system's bits and pieces. It's not like Unicode has defined, 
say, "vowelness" or pronunciation.

IOW> If you have a new character that needs encoding, then the UTC can probably be 
cadged into encoding it. If you are using existing encoded characters from another 
writing system, then there is nothing to do >>in Unicode<< except note the exceptional 
use of those characters.

That does leave you with the must less happy problem of finding a platform with user 
defined locales (approximately no platforms conveniently do this).

Obviously I'm not an expert in these linguistic areas (and hence rarely comment on 
them), but it seems to me that the lack of other mechanisms makes Unicode an 
attractive target for criticism in this area.

Best Regards,

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)  
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature. 

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Curtis Clark
> Sent: Friday, July 26, 2002 6:46 PM
> To: [EMAIL PROTECTED]
> Subject: *not* Tamil - changing scripts (long)
> 
> 
> James Kass wrote:
> > Isn't this kind of a Catch-22 for anyone contemplating script reform?
> > Do we discourage people from altering their own scripts?  Should we?
> > It is suggested that scripts can be "alive" in the same sense that
> > languages are "alive"; changes (which are part of life) just occur
> > much more slowly in scripts.
> 
> This touches on some "Unicode vs. the world" issues I've been thinking 
> about, having to do with indigenous peoples developing orthographies for 
> their own languages.
> 
> My two examples are both languages of the Takic group in southern 
> California. The LuiseÃ±o language declined to a very few native speakers, 
> but has enjoyed a renaissance in recent years. The Gabrieleno (Tongva) 
> language was effectively extinctâno native speakers, no recordings, some 
> amount of written documentationâbut the Tongva are resurrecting it (it 
> is similar enough to the other Takic languages that it is possible to 
> reconstruct parts that are missing).
> 
> Anthropological accounts of both languages are of course in the phonetic 
> alphabets beloved by linguists in the days before IPA stabilization. 
> And, like many other native Americans, the LuiseÃ±o and Tongva have 
> wanted simpler orthographies that can be typed with US-English keyboards.
> 
> I don't have a lot of familiarity with LuiseÃ±o, but web pages have 
> included passages where non-letters (such as @) are used as letters. 
> This solves the keyboarding problem (since few people would try to 
> pronounce an email address as LuiseÃ±o), but I imagine all sorts of 
> issues with sorthing, searching, word selection, casing, and all the 
> other sorts of things that computers can do for "major" languages.
> 
> Where all this involves me is with Tongva. I have been working with a 
> Tongva ethnobotanist on a project that, among other things, involves 
> plant labels in Tongva, English, and Latin. Tongva spelling is currently 
> inconsistent, and my colleague has been regularizing it for this project 
> (because he is the primary language teacher for the nation, and few have 
> any fluency at all, he has this freedom). Somewhat like English, Tongva 
> represents both the "oo" and "uh"  sounds both by "u". Unlike English, 
> the rest of the orthography provides no clues to which sound is meant.
> 
> /If/ my colleague were to ask (and the Tongva may be satisfied with the 
> existing orthograph

RE: User interface for keyboard input

2002-07-18 Thread Addison Phillips [wM]


Hi Martin,

I install the Chinese Unicode keyboard myself...

When I confronted this specific problem recently in our products, the main
solution I adopted was to allow \u notation as input (of course, our
products are for developers...)

Hope that helps.

Addison

Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Martin Kochanski
> Sent: Thursday, July 18, 2002 2:44 AM
> To: [EMAIL PROTECTED]
> Subject: User interface for keyboard input
>
>
> I'm working on Unicode-enabling a database product for Windows.
> This obviously includes making it possible for a user to type
> arbitrary Unicode characters, so I thought it might be a good
> idea to ask people on this list about the input methods that they
> found most intuitive. Quite apart from your theoretical insights,
> many of you will have much more experience than I have of having
> to enter exotic characters in real life.
>
> One might phrase the question like this: "If I sat you down in
> front of a program on a Windows machine, and asked you to type an
> alpha, what would you try first?". This is a question about
> intuitive expectations, so I am deliberately not specifying what
> program, nor what version of Windows, nor what keyboard -
> although we can take it for granted that you have not got a
> keyboard with Greek letters enabled, and that you do not have
> Keyman or anything similar.
>
> Incidentally, menu commands are probably not an acceptable
> solution, because if you can enter data then you must also be
> able to search for data, and searching means dialog boxes, and
> dialog boxes are not meant to have menus.
>
> The obvious thing to try is Alt+945: indeed, we have already
> implemented this; but it has the disadvantage that all available
> Unicode documentation uses hexadecimal character codes, not decimal ones.
>
> Various Windows programs offer ways of entering Unicode
> characters using hexadecimal codes, but they all seem to differ.
> Many of them use Alt+X, but there are at least three different
> and incompatible ways in which they do this (eg: is Alt+X a
> toggle or a command? In the latter case, does Alt+Shift+X invert
> it? Does it affect characters before the cursor or characters
> that you have just typed?).
>
> Much of this inconsistency is because Microsoft keep on changing
> their mind about how character entry should be done. This looks
> like evolution rather than vacillation, and it seems reasonable
> (though irritating), since the perfect solution is not always the
> first one you think of. But it does raise questions about what we
> should do to seem natural and intuitive and Just Like Any Other
> Windows Program - it is like the old question: "do you want me to
> hang this picture parallel to the ceiling or parallel to the
> floor, or do you want it horizontal?".
>
> This question has got rather long, but I thought that the more
> exact I made it, the simpler it might be to answer.
>
>
>

RE: Shift-JIS <-> Unicode?

2002-06-15 Thread Addison Phillips [wM]


http://oss.software.ibm.com/icu/userguide/conversion-data.html

You need to download ICU to get the relevant .ucm files or use the CVS link.

The files on the Unicode site are now marked obsolete (I beleve that Unihan 
incorporates much of the data). Nonetheless the mappings are here:

ftp://ftp.unicode.org/Public/MAPPINGS/OBSOLETE/EASTASIA/JIS/

~Addison



> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Stefan Persson
> Sent: 2002年6月15日 8:35
> To: Addison Phillips [wM]; [EMAIL PROTECTED]
> Subject: Re: Shift-JIS <-> Unicode?
> 
> 
> - Original Message -----
> From: "Addison Phillips [wM]" <[EMAIL PROTECTED]>
> To: "Stefan Persson" <[EMAIL PROTECTED]>;
> <[EMAIL PROTECTED]>
> Sent: Saturday, June 15, 2002 4:49 PM
> Subject: RE: Shift-JIS <-> Unicode?
> 
> >Yes. There is one on the Unicode FTP site. There is also one in the ICU
> library
> (http://oss.software.ibm.com/developerworks/opensource/icu/project
> /index.htm
> l).
> 
> I can't find any such mapping table on either of the sites.
> 
> Stefan
> 
> 
> _
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
> 
> 
>

RE: Shift-JIS <-> Unicode?

2002-06-15 Thread Addison Phillips [wM]


Yes. There is one on the Unicode FTP site. There is also one in the ICU library 
(http://oss.software.ibm.com/developerworks/opensource/icu/project/index.html).

Before you go off and get the table, though, you should note that:

1. There are several varieties of SJIS. Notably Microsoft code page 932 is not quite 
the same as what Oracle/Solaris/AIX/HP-UX call SJIS. You need to decide what you mean 
by "Shift-JIS".

2. Most programming environments have built in conversion capabilities that you can 
call and these will generally be more "natural" to use than writing your own. I say 
"natural" because the conversions will be similar to other programs running in that 
environment, providing the behavior that users expect.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Stefan Persson
> Sent: 2002年6月15日 6:57
> To: [EMAIL PROTECTED]
> Subject: Shift-JIS <-> Unicode?
> 
> 
> Is there any mapping table between Shift-JIS and Unicode 
> available anywhere?
> 
> 
> _
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
> 
> 
>

RE: Running Japanese programs under Win2K

2002-06-09 Thread Addison Phillips [wM]


> 
> Small correction here: MUI is not required in order to change the default
> system locale to Japanese and keep the user interface as English 
> on a Win2K
> enu machine.

Not in a Java program ::sigh::. Some parts of Swing and AWT use the user locale, 
especially older JREs...

Addison

>

RE: Running Japanese programs under Win2K

2002-06-09 Thread Addison Phillips [wM]


Hi Stefan,

Let's start with the easy one: Windows 2000 allows you to install any of the keyboards 
that you might need without a download. The Internet Explorer download you got earlier 
isn't necessary.

To install additional keyboards, open the "Regional Options" control panel and click 
on the last tab ("Input Locales") and add your keyboards there. While you're 
installing the Japanese IME you might want to install the Traditional Chinese 
"Unicode" keyboard, which lets you type any (UTF-16) code point value in as a hex 
sequence.

To solve some of your square box problems I would suggest that you get ahold of 
Microsoft's "Multilingual User Interface" package (MUI). I'm not sure of the licensing 
details of MUI--> I get mine in my copy of MSDN. This will allow you to set the locale 
of your machine to Japanese while keeping the menus iin English (or some other 
language). This mail was composed on such a machine.

The problem this will solve is the black square problem, since it installs the 
necessary system font.

For programs that are not internationalized and display bad text, this may "solve" 
your problems also, since your default windows code page will switch to 932 when you 
do this. However, you might regard programs that display ??? and mojibake text as 
"broken" and gently remind the manufacturer of the defects ;-).

Good luck,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Stefan Persson
> Sent: 2002年6月8日 14:30
> To: [EMAIL PROTECTED]
> Subject: Running Japanese programs under Win2K
> 
> 
> Hi!
> 
> I have the English edition of Windows 2000 Professional, and I'd 
> like to run
> some Japanese programs. Some programs, e.g. Internet Explorer and MS Word,
> work properly, while other display square boxes, question marks 
> and/or text
> misinterpreted as Windows-1252.
> 
> So: how shall I solve the above problems?
> 
> And: how shall I do to type Japanese characters? IME for Japanese
> (http://www.microsoft.com/msdownload/iebuild/ime5_win32/en/ime5_win32.htm)
> doesn't work: I get a message telling me that it doesn't work with this
> version of Windows.
> 
> Thanks!
> Stefan Persson
> 
> 
> _
> Do You Yahoo!?
> Get your free @yahoo.com address at http://mail.yahoo.com
> 
> 
>

RE: Web Form: General question

2002-05-29 Thread Addison Phillips [wM]


I believe this is covered by the FAQ located at
http://www.unicode.org/help/display_problems.html

The main issue, not to reiterate too much, is one of available fonts.
Basically there are four things that can happen when viewing Unicode text:

1. You see the character you expect to see.
2. You see multibyte trash (sometimes called "mojibake"), which is the
result of a program that is not aware of your current Unicode encoding
displaying the characters as if they were is some other encoding.
3. You see a question mark where you expect to see your character. This is
the result of a bad character encoding conversion (or more properly, the
encoding your text was converted to didn't contain the character you are
viewing).
4. You see a hollow box or black square. In this case, your software "knows"
what the character is, but "doesn't have a picture of it" to show you (that
is, your current font doesn't have this character).

On Windows there are some fonts (notably Arial Unicode MS) that can be
installed to show you nearly any Unicode character. On UNIX you are usually
tied to whatever is installed in your system. You may need to mix-and-match
fonts to see all of your characters (this is what Java tries to do and what
IE and Netscape do).

Hope that helps.

Thanks,

Addison

Addison P. Phillips
Globalization Architect
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Magda Danish (Unicode)
> Sent: Wednesday, May 29, 2002 11:58 AM
> To: unicode
> Subject: FW: Web Form: General question
>
>
>
>
> -Original Message-
> Date/Time:Mon May 27 20:58:22 EDT 2002
>
> Contact:  [EMAIL PROTECTED]
>
> Report Type:  General question
>
> Text of the report is appended below:
>
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
>
> When I use an editor(written by Python language) in Pc(Windows 2000),
> some Unicode can be shown correctly, such as square root symbol(u221A)
> and integration symbol(u222B), but they cannot be shown on Unix(be shown
> as a square box). On the contray, some Unicode  can be shown on Unix,
> such as the Middle dot(u00B7), but cannot be shown on Pc. I don't know
> why, how can I solve this problem?
>
> -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
> (End of Report)
>
>
>

RE: To hell with Unicode ;)

2002-05-28 Thread Addison Phillips [wM]


Hmmm...

I suspect that you could search-and-replace the word "Unicode" with the word
"multibyte" or the word "Japanese" and successfully turn the clock back ten
years. The difference between then and now is that internationalization
retrofit projects are being undertaken just to get Unicode support, rather
than to satisfy specific, transient language needs. And the results are
generally better [more useful to more people] than the single-purpose
solutions of the past.

That is: developers only used to think about these things when they were
forced to internationalize---and then they often took shortcuts to support a
single, specific language.

Unicode *is* complex. But I suspect that my copy of TUS (sans cover)
actually weighs less than my copy of Lunde/blowfish.

*I* would rather face that putative horde of zombies than go back to
multibyte enabling for a living.

Best Regards,

Addison

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Roozbeh Pournader
> Sent: Tuesday, May 28, 2002 5:08 AM
> To: Unicode List; unicoRe
> Subject: To hell with Unicode ;)
>
>
>
> Quoting 'Just van Rossum' from a post on the OpenType mailing list:
>
> [...]
> Sadly, some of the funniest quotes from the Python Quotations
> Collection are
> about Unicode (from http://www.amk.ca/quotations/python-quotes/
> and beyond):
>
>   I never realized it before, but having looked that over I'm certain
>   I'd rather have my eyes burned out by zombies with flaming dung sticks
>   than work on a conscientious Unicode regex engine.
>   -- Tim Peters, 3 Dec 1998
>
>   Unicode: everyone wants it, until they get it.
>   -- Barry Warsaw, 16 May 2000
>
>   I am becoming convinced that Unicode is a multi-national plot to take
>   over the minds of our most gifted (and/or most obsessive) programmers,
>   in pursuit of an elusive, unresolvable, and ultimately,
> undefinable goal.
>   -- Ken Manheimer, 19 Jul 2001
>
>   Unicode is the first technology I have to deal with which makes me hope
>   I die before I really really really need to understand it fully.
>   -- David Ascher, 19 Jul 2001
>
> roozbeh
>
>
>

RE: Normalization forms

2002-05-13 Thread Addison Phillips [wM]


Hi Lars,

Some information below...

Addison

Addison P. Phillips
Globalization Architect
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)
-
Internationalization is an architecture.
It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Lars Marius Garshol
> Sent: Monday, May 13, 2002 1:38 PM
> To: [EMAIL PROTECTED]
> Subject: Normalization forms
>
>
>
> I have been reading the Unicode Normalization UTR and have a couple of
> questions regarding it:
>
>  - will string comparison methods based on NFC and NFD always give the
>same results?

The same results compared to what? If you mean:

if {C}=={c} then {D}=={d}, then the answer is yes.

If you mean:

if {C} == {c} then {C} == {d}, then the answer is no. The forms are not
commutative.

>
>  - is it correct that methods based on NFKC and NFKD will give
>different results from ones based on NFC/NFD?

Yes. Emphatically. For example:

U+FF21 is U+FF21 in form C and does not equal U+0041.

but:

U+FF21 in Form KC becomes U+0041...

>
>  - if NFC and NFD give the same results, why are both specified? Why
>would an implementation choose one over the other?

Again the question is what you mean by "results". The composed form is
actually different than the decomposed one. It is generally more compatible
with what naive rendering software expects. The decomposed form, by
comparison, makes certain kinds of processing more efficient (for example,
certain kinds of collation processing).

>
>  - NFKC/NFKD seem to lose significant information; in what contexts
>are they intended to be used?

They have a number of useful contexts. Namespaces are one. Generally
speaking, the vast majority of characters unified by the compatibility forms
are rendering differences (such as half-width forms, super/sub scripts, and
the like) which make trouble in restricted namespaces (such as programming
identifiers, domain names, and the like). In addition, it is often possible
to introspect more meaning from data input fields by applying K forms.

For example, in some of the webMethods tools GUIs, strings that do not parse
successfully as numbers on the first pass are normalized Form KC (except for
super/subscripts) in order to improve parsing success.

>
> --
> Lars Marius Garshol, Ontopian http://www.ontopia.net >
> ISO SC34/WG3, OASIS GeoLang TChttp://www.garshol.priv.no >
>
>
>

RE: regarding unicode support in Oracle8i

2002-05-12 Thread Addison Phillips [wM]


Whups. Typo.

Try http://www.inter-locale.com/IUC19.pdf (IUC is International Unicode Conference. 
ICU is (IBM) International Classes for Unicode). 

Addison

> -Original Message-
> From: J M Sykes [mailto:[EMAIL PROTECTED]]
> Sent: 2002年5月12日 3:29
> To: Addison Phillips [wM]; [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: regarding unicode support in Oracle8i
> 
> 
> Check the link;
> For http://www.inter-locale.com/ICU19.pdf, I get:
> 
> 404 Not Found
> /ICU19.pdf was not found on this server.
> --
> --
> 
> Resin 1.1.3 -- Fri Jun 23 16:40:46 PDT 2000
> 
> 
> - Original Message -
> From: "Addison Phillips [wM]" <[EMAIL PROTECTED]>
> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
> Sent: Friday, May 10, 2002 4:53 PM
> Subject: RE: regarding unicode support in Oracle8i
> 
> 
> 
> For Oracle 8/8i you will probably want to configure your database 
> to use the
> UTF8 character set. This will affect how your DDL is written and mean some
> minor tweaks to your SQL statements (you'll need to remove the "N"
> qualifiers on your strings). I have a whitepaper from the Unicode 
> Conference
> #19 at http://www.inter-locale.com/ICU19.pdf that may help you a little.
> 
> 
>

RE: regarding unicode support in Oracle8i

2002-05-10 Thread Addison Phillips [wM]


Dear Sarada,

The thing you need to change is: don't use nchar in Oracle 8. Oracle 8 doesn't support 
the UTF-16 encoding. Oracle 9 and SQL Server provide 16-bit Unicode using the 
alternate character datatypes. In Oracle 8 the nchar/nvarchar types are used for an 
alternate character encoding, but this encoding can only be a Unicode encoding (UTF8 
or AL24UTFFSS) if the main character set of the database is already UTF8 (or 
AL24UTFFSS), which means you don't need nchar types and you still have to deal with 
the byte/char count differences.

For Oracle 8/8i you will probably want to configure your database to use the UTF8 
character set. This will affect how your DDL is written and mean some minor tweaks to 
your SQL statements (you'll need to remove the "N" qualifiers on your strings). I have 
a whitepaper from the Unicode Conference #19 at http://www.inter-locale.com/ICU19.pdf 
that may help you a little.

Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.




> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of P.S.L.Sarada Devi
> Sent: 2002年5月10日 4:12
> To: [EMAIL PROTECTED]
> Subject: regarding unicode support in Oracle8i
> 
> 
> Dear Sir/Madam,
> 
> We are trying to insert unicode chars(Hindi and Telugu) into a table in of
> data type 'nchar' in Oracle.
> We are able to do same thing in sqlserver.
> How it can be done in Oracle8i?
> Is there anything to be configured in Oracle server before doing that?
> Please mail me asap.
> Thank you all.
> 
> Regards,
> Sarada.
> 
> 
> 
>

RE: accessing extended ranges

2002-03-27 Thread Addison Phillips [wM]

Eric wrote:

>JDK 1.4 can render characters coded as surrogate pairs. This works in AWT
>and Swing.

Yup. It renders: I just tried it in my little test environment. But...

The control still "sees" both underlying characters. You can position the
cursor in the "middle" of the supplementary character and you can delete
HALF of it (or insert stuff in the middle of it). You can cursor through or
mouse select portions of it. And the text advance is for TWO characters (as
if both surrogate values took up equal space). So it's nearly useless,
except for static display. But it is better, I guess, than the old two-box
model.

Reminds me of the old days, where non-MBCS enabled programs appeared to work
(until they did a line break in the middle of a character, etc.).

Now, clearly a sufficiently motivated coder could re-implement the various
controls to work correctly, since the rendering is actually the hard bit.
Somehow I suspect that a BugParade issue would get rejected as "Unicode
3.0++ not supported in JDK1.4", even if everyone on this list went and voted
for it ;-).

Regards,

Addison

Addison P. Phillips
Globalization Architect
webMethods, Inc.
432 Lakeside Drive
Sunnyvale, California, USA
+1 408.962.5487 (phone)
+1 408.210.3569 (mobile)

Internationalization is an architecture.
It is not a feature.

RE: accessing extended ranges

2002-03-26 Thread Addison Phillips [wM]


Hi Ben,

The short answer is: you don't.

Java doesn't support characters outside the BMP (Basic Multilingual Plane) just yet. 
JDK 1.4 adds full support for Unicode 3.0, which includes a few more CJK characters, 
but not the 40,000 or so beyond U+.

That said, you can represent the characters as UTF-16 surrogate pairs (Java's internal 
representation is UTF-16). And some of the character converters will work properly 
(notably the UTF-8 converter). But as far as Java's concerned each surrogate code 
point is a separate character.

Vexingly, the folks at Javasoft haven't said how they'll implement support for Unicode 
3.1 and later. 

ICU4J, the IBM opensource project, provides some UTF-16 support capabilities that 
suggest a possible solution, but there are seemingly intractable problems with the 
Character class and char data type (luckily most APIs in Java take int arguments for 
characters instead of char). And it is pretty easy to build classes for processing 
these characters as surrogate pairs using the Unicode character database.

The downside is that the GUI stuff, Swing and AWT, don't recognize surrogates 
properly. Paste U+D800 U+DC00 into a Swing control and you'll see TWO hollow boxes, 
not one... the JDK is rendering the characters separately. (NB> I haven't tried this 
test with 1.4, so there may be more support there for surrogates).

So, using ICU you can probably do some of the processing you're interested in. But GUI 
apps are going to be very problematic until Swing or AWT are fixed.

Hope that helps.

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.


> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Ben Monroe
> Sent: 2002年3月26日 0:17
> To: Unicode list
> Subject: accessing extended ranges
> 
> 
> I would like to access some of the characters from "CJK Unified Ideographs
> Extension B." These are all in the range of 2-2A6DF. (direct link:
> http://www.unicode.org/charts/PDF/U2.pdf )
> 
> "Basic Latin" appears in -007F range. The original "CJK Unified
> Ideographs" all appear within the 4E00–9FAF range. These are all easy to
> access with U+ (4 x's). In Java, the format /u works just 
> fine (and
> also the same for http://www.macchiato.com/unicode/ ). However, how do you
> access the characters in the larger ranges (ie, U+x or /ux)?
> 
> Directly using the 5 value format /ux produces are Unicode character
> followed by the 5th x. Here is a quick example:
> 
> public class UniStringTest {
>   static public void main(String[] args) {
> String s1 = "\u963F"; // displays fine; standard /u (4x's)
> System.out.println(s1);
> String s2 = "\u9FA0"; // also displays fine; standard /u (4x's)
> System.out.println(s2);
> String s3 = "\u2A6A5"; // biggest character that I know (5x's) but
> doesn't process
> System.out.println(s3);
> }
> }
> 
> I understand this isn't a programming ML, but I just used the Java program
> as an example.
> I'd appreciate some input.
> Thanks,
> 
> Ben Monroe
> 
> 
>

RE: 16 bit unicode

2002-02-18 Thread Addison Phillips [wM]


Hi Brian,

By "16-bit Unicode" can I assume that you mean the UTF-16 (formerly UCS-2) encoding?

Depending on the programming language and environment you are using, you may already 
be using "16-bit Unicode". For example, most of Microsoft's C and VB Access drivers, 
as far as I recall, use COM, in which strings are generally represented as UTF-16, for 
example.

Oracle just added a 16-bit Unicode encoding to Oracle 9i. Whether it makes sense for 
you to use it or not is probably a more complex implementation question that I can 
explore here. I should note that Oracle 7x and 8x (and 9x for that matter) have a 
perfectly serviceable Unicode encoding based on UTF8. You may find it easier to port 
your project to Oracle using this multibyte encoding. Generally speaking, it doesn't 
matter *which* Unicode encoding you choose for your database.

Oracle's drivers support both the new 16-bit nchar/nvarchar data types and the UTF-8 
(or AL32UTF8 on Oracle 9i) encoding with traditional char/varchar2 types.Your DDL will 
vary slightly depending on which of the two encodings you choose and whether you need 
strict backward compatibility with earlier Oracle versions (Oracle added keywords to 
their DDL to help with Unicode implementation). Again, which Unicode encoding you 
choose for the actual database is an open question. However, the drivers generally 
insulate you from which encoding is used on the back end, so that from a programming 
perspective it becomes unimportant (and do note that word "generally").

There have been a number of presentations at recent International Unicode Conferences. 
Go back and look at IUC19 in particular for material on Oracle and Access from their 
respective companies. I also have a presentation only obliquely related, but which may 
help you with your confusion about 16- or 8-bit flavors of Unicode in databases 
(http://www.inter-locale.com/IUC19.ppt)

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Magda Danish (Unicode)
> Sent: 2002年2月18日 10:02
> To: [EMAIL PROTECTED]
> Subject: FW: 16 bit unicode
> 
> 
> 
> From: [EMAIL PROTECTED]
> Date: 2002-02-14 05:26:42 -0800
> To: [EMAIL PROTECTED]
> Subject: 16 bit unicode
> 
> 
> 
> Hello,
> 
> I am presently investigating the use of 16bit unicode within Oracle9i
> and Access 2000 On searching the web I have not yet been able to
> successfully locate driver that supports 16 bit unicode.
> 
> Do you know of any sites or have any information of drivers for 16 bit
> unicode for Oracle9i ?
> 
> Regards
> Brian Hooker
> 
> 
>

RE: ICU website

2002-02-09 Thread Addison Phillips [wM]


The server changed to www-124.ibm.com

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.


> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Roozbeh Pournader
> Sent: 2002年2月9日 8:15
> To: Unicode List
> Subject: ICU website
> 
> 
> 
> Does anyone know if the web address for ICU has changed? My URL, 
> , gives me a name lookup error.
> 
> roozbeh
> 
> 
>

RE: Unicode and Security: Domain Names

2002-02-07 Thread Addison Phillips [wM]


It is one of the competitors for internationalized domain names. The "ACE"
stands for "ASCII Compatible Encoding".

The encoding which appears likely to gain overall acceptance is called DUDE
and can be found here: http://www.i-d-n.net/draft/draft-ietf-idn-dude-02.txt

There are several ACE encoding demos on the 'Net (Mark Davis has one at
www.macchiato.com, I have one at www.inter-locale.com)

http://www.i-d-n.net is where you can find out about a whole zoo of Unicode
transfer encoding schemes proposed for use in DNS, plus the relevant issues,
of which there turn out to be a number when creating I18n domain names. The
early implementers have mostly ignored these issues and the interplay
between the ultimate standard and existing registrars should be interesting.

Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  |  The Business Integration Company
432 Lakeside Drive, Sunnyvale, California, USA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Tom Gewecke
Sent: Thursday, February 07, 2002 3:20 PM
To: [EMAIL PROTECTED]
Subject: Re: Unicode and Security: Domain Names


I note that companies like Verisign already claim to offer "domain names"
in dozens of languages and scripts.  Apparently these are converted by
something called RACE encoding to ASCII for actual use on the internet.

Does anyone know anything about RACE encoding and its properties?

RE: Arabic and Hindi digits, what to store ?

2002-01-26 Thread Addison Phillips [wM]


Dear Isam,

Generally, when storing numeric data, the answer is "neither". Use a numeric type 
(like int, long, float, etc.) and only convert the numbers at display time.

If you mean "should I change numeric characters in textual data (strings)", then the 
answer depends on your application. In most cases, it is a bad idea to change a user's 
textual data because you typically cannot recover the initial state of the data later 
(when you might need it). Users may be surprised to see their data mutating.

Instead, you can make use of the Unicode character database, digit folding, and 
normalization to perform runtime analysis of the data (for example, to retrieve the 
number value of the string).

Of course, in some applications you may need/prefer to pre-process the data instead of 
preserving the original string. Or you may need to create relationships (as in a 
database) that require you to process the data in this way (so that matches match). A 
combination of digit-folding (to ASCII) and Unicode Form C normalization works pretty 
well. *Careful* processing using Form KC can also be useful sometimes (see link 
below). Again: if you're processing values that are strictly numeric, make them into 
typed objects!

The other common use of numbers is in dates: parsing the date into a date data type 
(much like I just recommended for numbers) makes a lot of sense, especially in locales 
(such as many of the Arabic locales) in which you may wish to use more than one or 
variant calendars to display the same date value.

Some useful links, especially the last:

http://www.unicode.org/unicode/reports/tr15/
http://www.w3.org/TR/WD-charreq
http://www.w3.org/TR/charmod/#sec-Normalization
http://www.w3.org/TR/1999/WD-unicode-xml-19990928/#Compatibility

I hope that helps.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  |  The Business Integration Company
432 Lakeside Drive, Sunnyvale, California, USA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature. 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Isam Bayazidi
Sent: Saturday, January 26, 2002 5:50 PM
To: [EMAIL PROTECTED]
Subject: Arabic and Hindi digits, what to store ?


Hi all ..
I have a quick question, we are developing several Arabic enabled software, 
and adding Arabic support to already existing ones .. and one of the issues 
that we faced is Should we store the numbers in thier Hindi Format or ASCII ? 
we know that showing them in what ever look is a matter of preferance, but 
what we are asking .. what would be better action to do , to store the digits 
displayed in hindi in thier Hindi encodings, or use the Arabic digits defined 
in ASCII ( the first 128 places of ISO ) ?

-- 
Yours,
Isam Bayazidi
Amman - Jordan

 Think Linux + Think Arabic = Think www.arabeyes.org

RE: Microsoft's Japanese IME has no Unicode option

2002-01-25 Thread Addison Phillips [wM]

There is a quite simple way to do what you want:

If you want to input directly into an HTML form on the Geocities site, all
you have to do is pull down your "view" menu (I presuppose IE here) and
choose "UTF-8" from the Encoding submenu. Since Geocities doesn't send a
META tag, your browser will now encode all of the data you type as UTF-8 for
you and those are the bytes that will get stored in your page on the
back-end. The reason you're getting Shift-JIS now is that your browser is
probably set to "Japanese auto-detect" and ASCII is certainly valid
Shift-JIS..

Note that adding a META tag to your page is a very good idea if you decide
to use UTF-8 as the encoding.

You can see that this works here:
http://www.geocities.com/apphillips2000/index.html

You will note that I included a META tag. Otherwise you have to manually
select UTF-8 as the page encoding.

Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  |  The Business Integration Company
432 Lakeside Drive, Sunnyvale, California, USA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature.

RE: Allow to type URL in Unicode

2002-01-23 Thread Addison Phillips [wM]


Hi Eric,

Domain names are a work in progress. Currently they are restricted to a subset of 
ASCII. There is a working group within the IETF working on a solution to this which 
happens to use Unicode.

As for the rest of a URL, browsers and webservers have long supported non-ASCII URLs 
(via %-encoding of the hex values of the underlying bytes/octets). Because the URL 
contains no character encoding information, the browser and server must have agreed in 
advance on an encoding, or the hex values will be misinterpreted by the server and 
(most likely) a 404 error returned.

The W3C I18N group has (strongly) recommended that browsers and servers always use the 
UTF-8 encoding of Unicode in URLs. Internet Explorer 5.0 and later automatically 
encode the characters you type into a URL using the UTF-8 encoding (except in Japan, 
Korea, and Taiwan, where the option is turned off by default and users may turn it on) 
up to the "?". So you if type a non-ASCII string into IE5's URL entry box, the server 
will receive a percent encoded UTF-8 representation of that string.

This still leaves some issues to do with normalization, path parsing, the query bits 
after the "?", and so on. But you can make things work in most cases with a little 
care.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  |  The Business Integration Company
432 Lakeside Drive, Sunnyvale, California, USA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature. 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Eric Lannaud
Sent: Wednesday, January 23, 2002 9:29 AM
To: Unicode List
Subject: Allow to type URL in Unicode


Hi,

Some one kwon if it's possible to type an URL
(http://xxx.x./yyy//yyy.yyy) in a browser not in Ascii
caracters but in another caracters (unicode)?

Of course there many implications: Browser, Domain names servers, web
server. May be exist yet a working group about this subject?

Many thanks
Eric Lannaud

RE: Fun with UDCs in Shift-JIS

2002-01-17 Thread Addison Phillips [wM]


According to Lunde (p. 205), the range is through F9FC. There are real characters in 
the range FA40 -> FC4B, at least in CP932,  which may be causing you some confusion, 
since these have concrete mappings to Unicode (not just a mapping in the U+E000 range).

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.



> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Lars Marius Garshol
> Sent: 2002年1月17日 7:02
> To: [EMAIL PROTECTED]
> Subject: Fun with UDCs in Shift-JIS
> 
> 
> 
> I've just discovered that it seems that Shift-JIS encodes a number of
> User-Defined Characters in the 0xF040 to 0xFCFC range, and that these
> characters are used in web pages. Does anyone know of a source of
> mappings for these characters, or even have information about what
> kinds of characters are found in this area?
> 
> Google searches found a number of mentions of this, and even one
> mapping, but none of them seemed to be usable.
> 
> Also, does anyone know of a Shift-JIS web page that uses one of these
> characters?
> 
> --Lars M.
> 
> 
>

RE: Unicode Search Engines

2002-01-16 Thread Addison Phillips [wM]




Most search engines can search sites encoded in UTF-8. 
That isn't generally the problem. The problem is entering the data to search 
for. Most of the search pages aren't encoded in a Unicode encoding, so you can't 
enter "just any" Unicode characters to search for...
 
One exception that I know well (I did some I18N work 
for them) is AltaVista. If you go to the search page and click on "Customize 
Settings", you can set the search engine to use UTF-8 as your search encoding 
(both input and output). The back end is entirely Unicode.
 
Regards,
 
Addison

Addison P. PhillipsGlobalization Architect / Manager, 
Globalization EngineeringwebMethods, Inc.  |  The Business 
Integration Company432 Lakeside Drive, Sunnyvale, California, USA+1 
408.962.5487 (phone)  +1 408.210.3569 
(mobile)-Internationalization 
is an architecture. It is not a feature. 
 
 
-Original Message-From: [EMAIL PROTECTED] 
[mailto:[EMAIL PROTECTED]]On Behalf Of Aman 
ChawlaSent: Wednesday, January 16, 2002 11:49 AMTo: 
UnicodeSubject: Unicode Search Engines
Are there any search engines at all at present 
which allow one to search sites encoded in UTF-8? If not, are there plans to 
build such search engines? For example, is Google going to implement such an 
engine?
 
Aman Chawla

RE: How to print the byte representation of a wchar_t string with non -ASCII ...

2001-11-02 Thread Addison Phillips [wM]

Hi William,

The third case rests on how the compiler interpreted the string at compile time. 
What's the encoding of your source file? What was the encoding of the locale at 
compile time?

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature. 

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Tay, William
Sent: Friday, November 02, 2001 9:38 AM
To: Unicode Mailing List
Subject: RE: How to print the byte representation of a wchar_t string
with non -ASCII ...

Dear Unicoders & C gurus,

Thank you for your comments on my previous posting. They help. Have a
question while digesting them on machine, would appreciate your help.   

At Solaris 2.6 shell prompt execute the program below by doing: 
> setenv LC_ALL en_US.UTF-8
> a.out fôó

#include , , , 

main(int argc, char* argv[])
{
   int i;
   wchar_t wstr[20];
   char mstr[20];

   setlocale(LC_ALL, "");  // char encoding is that of shell, i.e. UTF-8 

   // MB: MultiByte; WC: WideChar   
   printf("stdin in MB: %s, strlen: %d\n", argv[1], strlen(argv[1]));
   printf("Byte rep: ");
   for (i = 0; i < strlen(argv[1]); i++)
   printf("%02X ", argv[1][i]);

   mbstowcs(wstr, argv[1], 20);
   printf("stdin in WC: %ls, wcslen: %d\n", wstr, wcslen(wstr));
   // Guess this is the only way to see the byte rep of wstr string
   wcstombs(mstr, wstr, 20);
   printf("Byte rep: ");
   for (i = 0; i < strlen(mstr); i++)
   printf("%02X ", mstr[i]);

   wstr = L"fôó";
   mstr = "fôó";

   printf("App string in MB: %s, strlen: %d\n", mstr, strlen(mstr));
   printf("Byte rep: ");
   for (i = 0; i < strlen(mstr); i++)
   printf("%02X ", mstr[i]);

   printf("App string in WC: %ls, wcslen: %d\n", wstr, wcslen(wstr));
   // Guess this is the only way to see the byte rep of wstr string
   char mtemp[20];
   wcstombs(mtemp, wstr, 20);
   printf("Byte rep: ");
   for (i = 0; i < strlen(mtemp); i++)
   printf("%02X ", mtemp[i]);
}

Output:

stdin in MB: fÃ´Ã³, strlen: 5
Byte rep: 66 C3 B4 C3 B3

stdin in WC: fÃ´Ã³, wcslen: 3
Byte rep: 66 C3 B4 C3 B3

App string in MB: fôó, strlen: 3
Byte rep: 66 F4 F3

App string in WC: fÃ´Ã³, wcslen: 3
Byte rep: 66 C3 B4 C3 B3

-

setlocale(LC_ALL, ""); I believe instructs the program to inherit the
encoding of the shell, i.e. UTF-8 in this example. In the 3rd case above,
shouldn't the result be the same as the 1st, since the string from stdin and
the program defined var are using the same encoding scheme? 

Will

-Original Message-
From: Jungshik Shin [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 01, 2001 3:11 PM
To: Unicode Mailing List
Subject: Re: How to print the byte representation of a wchar_t string
with non -ASCII ...

[EMAIL PROTECTED] wrote:

> In a message dated 2001-10-31 10:07:44 Pacific Standard Time,
> [EMAIL PROTECTED] writes:

>> This is wrong.  wchar_t strings can of course be printed.  Reading the
>> ISO C standard would tell you to use
>>
>>   printf ("%ls", wstr);
>>
>> can be used to print wchar_t strings which are converted to a byte
>> stream according to the currently selected locale.  Eventually it has

> But won't this approach fail as soon as we hit a 0x00 byte (i.e. the
> high 8 bits of any Latin-1 character)?

   I'm not sure what you're alluding to here. As long as
all characters in wstr belong to the repertoire of the encoding/
character set of the current locale (that is, unless one
passes wstr containing Chinese characters to printf() in,
say, de_DE.ISO8859-1 locale),
there should not be any problem with using '%ls' to
print out wstr with printf(). Of course, 'printf ("%ls", wstr) '
doesn't achieve what the original question asked for, but that
question has already been answered, hasn't it?

  fprintf() man page in Single Unix Spec v2 (perhaps,
I should look at the actual C standard) doesn't seem to say anything
about what to expect
when wstr contains characters outside the repertoire of
the character set of the current locale. wcrtomb() is called
for each wide char in wstr when '%ls' is used
to print out wstr. According wcrtomb() man page,
errno is set to EILSEQ if an invalid wide char.
is given to it, but it's not clear whether
'invalid wide char' in the man page of wcrtomb() includes
valid wide chars which are NOT convertible to the encoding
of the current locale.

  Jungshik Shin

RE: Character encoding at the prompt

2001-10-24 Thread Addison Phillips [wM]


Hi William,

The answer is that it depends on the current user locale.

Generally, Western European languages in Windows use Code Page 1252 for GUI
displays and either Code Page 437 (US English) or Code Page 850 for "dos
boxes" (the "cmd" prompt). On Windows NT this can be changed manually with
the "chcp" command. Changing your actual system locale ("Regional Options")
will also change the windows and command line code pages as appropriate.
Fair warning: do NOT experiment with Asian locales on European builds of NT
4.0 systems (that you care about).  In "Microsoft-ese", the Windows code
page is the ANSI code page and the command line is the OEM code page. In
this case, ANSI has nothing to do with the standards organization or any
particular encoding---it's just a name to differentiate the code page from
the OEM flavor. There is documentation on the MS website that I am too
pressed for time to lookup the URL for.

On most UNIX-like operating systems, the current locale controls the
encoding. In fact, the encoding is part of the locale name. Generally
Western European languages use ISO-8859-1 (aka Latin-1). Solaris 2.7 and
especially 2.8 add support for nifty new encodings (including UTF-8, a
Unicode encoding). If you type "locale" at the shell prompt, you will see a
listing of your various locale settings, which will include the current
encoding. Unlike Windows, the locale (and thus encoding) apply to both
command line and GUI interfaces. Also unlike Windows, the locale setting is
process specific. Child processes inherit the parent's environment, so if
you change your locale and then launch a GUI program, that program will have
a matching locale. Of course, this is a generalization.

Don't forget that file systems and shells have a part to play in your
command line excursions.

Hope this helps.

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Tay, William
Sent: Wednesday, October 24, 2001 5:08 PM
To: [EMAIL PROTECTED]
Subject: Character encoding at the prompt


Hi,

Do you have any idea what is the default code page and encoding scheme for
MS DOS box in WinNT 4? Is there any command that can give me the info? I am
trying to input a string say "fráç" at the prompt, wondering how the
characters are encoded.

How about at the Unix (Solaris 2.6) prompt, what's the default and how to
change?

Thanks.

Will

RE: Right-to-left whitespace?

2001-10-08 Thread Addison Phillips [wM]


Hi Dennis,

Your problem is understandable, but isn't due to the properties of the
character. U+0020 ("normal space") has the bidirectional category of
whitespace. If your edit control worked properly, the insertion of a
whitespace character would not, in and of itself, change the directionality
of the text in the document you are editing. In other words, it's not your
choice of characters, it's your edit control or your operating environment
(if you're using a standard control from the OS).

See UAX#9 (Unicode Technical Report #9: Unicode Bidirectional Algorithm) for
more information here:
http://www.unicode.org/unicode/reports/tr9/#Bidirectional_Character_Types

What this essentially means is that you get the wonderful task of either
finding a replacement control that is properly internationalized or writing
one yourself (or activating the necessary internal features of your existing
control, which may not be well documented, if documented they be). Since I'm
not much of a Delphi developer, I can't help you with specifics. Your
development environment might not even support bidi applications (for all I
know). I know this has been a shortcoming of other development environments
and languages over the years: good, bad or indifferent bidi support is
usually attempted well after "multibyte" support or Unicode support because
of the relative perceived difficulty in crafting the code.

Hopefully someone else on the list can provide Delphi specific information.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Dennis Bemmann
Sent: Monday, October 08, 2001 11:11 AM
To: [EMAIL PROTECTED]
Subject: Right-to-left whitespace?


Hello everybody,

I'm desperately looking for a right-to-left whitespace. I searched the
web a lot, but I couldn't find an answer, so I hope that somebody here
can help.

I am developing a Delphi application for writing Arabic texts under
non-Arabic windows. Inserting the Arabic characters in the edit
component works fine. However, there is no whitespace in the Arabic
character set, so I'm using the regular one... and this creates a big
problem:

since this space is left-to-right (unlike Arabic), the text flow is
interrupted at every point where I insert it, and words get into the
wrong order. For example, if I enter the characters A,B,C,SPACE,D,E,F
(assume they were Arabic), I expect the following on the screen: "FED
CBA". Instead I get this: "CBA FED". I understand why it comes to this
result: obviously the editor component thinks that the Arabic paragraph
ends here and appends the latin text "after" (right from) it. But this
is of course not what I want, and I suppose there must be a way without
reprogramming the whole editor.

So... I need a reverse-whitespace, or any other solution.
Thanks a lot for your help!

Dennis

RE: surrogate at java's property file

2001-10-04 Thread Addison Phillips [wM]

Carl,

Well I'm not too concerned about it. I know (heck, *you* know) the guys over
there. They've done good work to date. I don't doubt they have a solution up
their collective sleeves.

In fact, the problem is basically that no matter which path they pick
(UTF-16 or UTF-32), the Character and String class methods will have to be
changed to deal with it. At present, many of the Character class methods
(and, of course, many aspects of J2SE based on these methods) rely on a
relationship to char or a single "code unit" in a String---e.g. a UCS-2 code
unit. A "simple" solution might be to redefine Character as UTF-32 (Scalar
Value based) and keep UTF-16 for Strings... but then there are various
access methods in String that would have to be deprecated or replaced. Yuck.

In fact, I suspect (based on some evidence such as presentations at IUC17)
that the basic plumbing (data tables) is in place in 1.4. It'll be
interesting to see how they solve the problem. It's not for lack of asking
that I don't know the answer ;-)

The flip side of all this is the compatibility issue. For example,
properties files are tied to UTF-16. Interoperability with JDK 1.x products
will depend on (as far as I can tell) a UTF-16 implementation. I'm not sure
they *can* change to UTF-32 at this point. Anyhow, at this point all this
curiosity is academic. The earliest we'll see Unicode 3.1 support in Java
appears to be the next release beyond 1.4, or about a year from now. We'll
know by then what solution has been adopted. It should be interesting.

Best Regards,

Addison

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Carl W. Brown
Sent: Thursday, October 04, 2001 5:37 PM
Cc: [EMAIL PROTECTED]
Subject: RE: surrogate at java's property file

Addison,

It might be easier to convert the JVM from UCS-2 to UTF-32 so that you do
not have to worry about surrogates.  This would more closely match most Unix
implementations (except Sun) where Java is widely used.

Carl

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Addison Phillips [wM]
> Sent: Wednesday, October 03, 2001 4:31 PM
> To: Yung-Fong Tang
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: RE: surrogate at java's property file
>
>
> No fair! You forgot to quote my disclaimer in the next email for my big
> boo-boo regarding what an int is in Java. An int is fine, darnit!
> It's char
> that was originally (at least externally) limited to 16-bits. Of course,
> many APIs use ints, which don't present a problem. But java.lang.Character
> and java.lang.String would have to change internal representation or add
> methods or something to allow surrogate pairs to be evaluated.
>
> Addison
>
> -----Original Message-
> From: Yung-Fong Tang [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, October 03, 2001 4:17 PM
> To: Addison Phillips [wM]
> Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: Re: surrogate at java's property file
>
>
> Brian Beck:
> What do you think ?
>
> "Addison Phillips [wM]" wrote:
>
> > Java doesn't define any characters beyond Unicode 2.1.8 at the moment.
> It's
> > stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of
> these
> > versions have defined characters in the supplemental planes.
> >
> > In Java, a java.lang.Character object is closely tied to the
> definition of
> > an "int", the 16-bit numeric type. Many classes and objects make no
> > distinction (or worse, conflate a character with an int---many
> methods are
> > defined to take and return ints for "Characters"). As a result, the Java
> > character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
> > surrogate character *is* recognized to be a surrogate, but a
> high-low pair
> > is not recognized as representing a character, nor can you retrieve the
> > character properties of the matched pair.
> >
> > So to property files. The java.lang.Character sequence U+D800 U+DC00 is
> > represented by the sequence "\ud800\udc00". This sequence does NOT
> represent
> > U+1. It represents TWO Characters, which happen to be
> surrogates that
> > form a valid pair. I should point out that Java is slightly clever. For
> > example, the UTF-8 converter knows that U+D800 U+DC00 represents the
> scalar
> > value U+1 and encodes it as a valid four byte sequence: f0-90-80-80
> (and
> > vice versa, of course).
> >
> > However, it is unclear how Unicode 3.1 support is going to make it into
> JDK
> > 1.4++. The APIs are going to have to change to support the supplemental
> > planes and

RE: Unicode characters in applet.

2001-10-04 Thread Addison Phillips [wM]

Title: Unicode characters in applet.



Actually, i18n.jar contains character encoding 
converters and other useful things... but nothing that will help you display 
Unicode characters: that's built-in even for the "crippled" JREs that do not 
include i18n.jar.
 
Your problem is probably font releated. If you check 
out http://www.unicode.org/help/display_problems.html 
you will see information on how to turn those infuriating hollow boxes into the 
characters you expected to see the first time.
 
Addison
 

Addison P. PhillipsGlobalization Architect / Manager, 
Globalization EngineeringwebMethods, Inc.  432 Lakeside Drive, 
Sunnyvale, CA+1 408.962.5487 (phone)  +1 408.210.3659 
(mobile)-Internationalization 
is an architecture. It is not a feature.

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]]On Behalf Of Raghvendra 
  SharmaSent: Wednesday, October 03, 2001 10:43 PMTo: 
  [EMAIL PROTECTED]Subject: Unicode characters in 
  applet.
  Hi all, 
  I have been trying to figure out ways and means to 
  show Unicode characters (certain Greek/mathematical symbols) in my java applet 
  running on jre 1.3.x, but all in vain.  I am not able to hit as to how 
  should I start off.
  Someone told me to use i18n.jar for 
  internationalization of code.  But I couldn't get the thing 
  exactly. 
  In the applet I intend to show the characters as 
  part of a JEditorPane component which is anyway rendering all other html 
  content.
  Can someone guide me as to how can I manage to my 
  applet to show those characters.  
  regards 
  raghav.. 
  Raghvendra Sharma 
  Systems Executive Software Solutions Business NIIT Ltd. 8, Balaji Estate Kalkaji, New 
  Delhi-20 6203958 
  ~~ "Natural ability is by far the best, but many men have succeeded 
  in winning high renown by skill that is the fruit of 
  teaching." 
  ___NOTICE This electronic mail transmission contains confidential 
  information intended only for the person(s) named.  Any use, 
  distribution, copying or disclosure by any other person is strictly 
  prohibited. If you received this transmission in error, please notify the 
  sender by reply e-mail and then destroy the message.  Opinions, 
  conclusions, and other information in this message that do not relate to the 
  official business of NIIT shall be understood to be neither given nor endorsed 
  by NIIT When addressed to NIIT clients, any information contained in this 
  e-mail is subject to the terms and conditions in the governing client 
  contract.

RE: surrogate at java's property file

2001-10-03 Thread Addison Phillips [wM]


No fair! You forgot to quote my disclaimer in the next email for my big
boo-boo regarding what an int is in Java. An int is fine, darnit! It's char
that was originally (at least externally) limited to 16-bits. Of course,
many APIs use ints, which don't present a problem. But java.lang.Character
and java.lang.String would have to change internal representation or add
methods or something to allow surrogate pairs to be evaluated.

Addison

-Original Message-
From: Yung-Fong Tang [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, October 03, 2001 4:17 PM
To: Addison Phillips [wM]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: surrogate at java's property file


Brian Beck:
What do you think ?

"Addison Phillips [wM]" wrote:

> Java doesn't define any characters beyond Unicode 2.1.8 at the moment.
It's
> stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of
these
> versions have defined characters in the supplemental planes.
>
> In Java, a java.lang.Character object is closely tied to the definition of
> an "int", the 16-bit numeric type. Many classes and objects make no
> distinction (or worse, conflate a character with an int---many methods are
> defined to take and return ints for "Characters"). As a result, the Java
> character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
> surrogate character *is* recognized to be a surrogate, but a high-low pair
> is not recognized as representing a character, nor can you retrieve the
> character properties of the matched pair.
>
> So to property files. The java.lang.Character sequence U+D800 U+DC00 is
> represented by the sequence "\ud800\udc00". This sequence does NOT
represent
> U+1. It represents TWO Characters, which happen to be surrogates that
> form a valid pair. I should point out that Java is slightly clever. For
> example, the UTF-8 converter knows that U+D800 U+DC00 represents the
scalar
> value U+1 and encodes it as a valid four byte sequence: f0-90-80-80
(and
> vice versa, of course).
>
> However, it is unclear how Unicode 3.1 support is going to make it into
JDK
> 1.4++. The APIs are going to have to change to support the supplemental
> planes and the ripple effects on various APIs seems like an interesting
> problem. Perhaps they'll redefine an int to be a 32-bit value and switch
> Java to UTF-32 (yeah, sure.)
>
> Best Regards,
>
> Addison
>
> Addison P. Phillips
> Globalization Architect / Manager, Globalization Engineering
> webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
> +1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
> -
> Internationalization is an architecture. It is not a feature.
>
> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Yung-Fong Tang
> Sent: Monday, October 01, 2001 5:10 PM
> To: [EMAIL PROTECTED]
> Subject: surrogate at java's property file
>
> Any one know how does Java handle Surrogate pair property file ?
>
> Java's property file use the \u encoding for non ASCII characters,
> therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
> Surrogate Pair?
>
> Does U+1 (0xd800 0xdc00) encoded as "\u1" or "\ud800\udc00" ? (I
> think it should be \u1) or they cannot handle them at all ?

RE: surrogate at java's property file

2001-10-01 Thread Addison Phillips [wM]


But then, it's my day to be an idiot...

Of course an int can store more than 16 bits. It's char that's defined at
0..65535 in Java. int's will work fine in the APIs. It's the chars that are
a problem.

Must be the heat. ;-)

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.


-Original Message-----
From: Addison Phillips [wM] [mailto:[EMAIL PROTECTED]]
Sent: Monday, October 01, 2001 6:24 PM
To: Yung-Fong Tang; [EMAIL PROTECTED]
Subject: RE: surrogate at java's property file


Java doesn't define any characters beyond Unicode 2.1.8 at the moment. It's
stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of these
versions have defined characters in the supplemental planes.

In Java, a java.lang.Character object is closely tied to the definition of
an "int", the 16-bit numeric type. Many classes and objects make no
distinction (or worse, conflate a character with an int---many methods are
defined to take and return ints for "Characters"). As a result, the Java
character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
surrogate character *is* recognized to be a surrogate, but a high-low pair
is not recognized as representing a character, nor can you retrieve the
character properties of the matched pair.

So to property files. The java.lang.Character sequence U+D800 U+DC00 is
represented by the sequence "\ud800\udc00". This sequence does NOT represent
U+1. It represents TWO Characters, which happen to be surrogates that
form a valid pair. I should point out that Java is slightly clever. For
example, the UTF-8 converter knows that U+D800 U+DC00 represents the scalar
value U+1 and encodes it as a valid four byte sequence: f0-90-80-80 (and
vice versa, of course).

However, it is unclear how Unicode 3.1 support is going to make it into JDK
1.4++. The APIs are going to have to change to support the supplemental
planes and the ripple effects on various APIs seems like an interesting
problem. Perhaps they'll redefine an int to be a 32-bit value and switch
Java to UTF-32 (yeah, sure.)

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature.


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Yung-Fong Tang
Sent: Monday, October 01, 2001 5:10 PM
To: [EMAIL PROTECTED]
Subject: surrogate at java's property file


Any one know how does Java handle Surrogate pair property file ?

Java's property file use the \u encoding for non ASCII characters,
therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
Surrogate Pair?

Does U+1 (0xd800 0xdc00) encoded as "\u1" or "\ud800\udc00" ? (I
think it should be \u1) or they cannot handle them at all ?

RE: surrogate at java's property file

2001-10-01 Thread Addison Phillips [wM]


Java doesn't define any characters beyond Unicode 2.1.8 at the moment. It's
stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of these
versions have defined characters in the supplemental planes.

In Java, a java.lang.Character object is closely tied to the definition of
an "int", the 16-bit numeric type. Many classes and objects make no
distinction (or worse, conflate a character with an int---many methods are
defined to take and return ints for "Characters"). As a result, the Java
character model appears to be tied to UCS-2 (and I don't mean UTF-16). A
surrogate character *is* recognized to be a surrogate, but a high-low pair
is not recognized as representing a character, nor can you retrieve the
character properties of the matched pair.

So to property files. The java.lang.Character sequence U+D800 U+DC00 is
represented by the sequence "\ud800\udc00". This sequence does NOT represent
U+1. It represents TWO Characters, which happen to be surrogates that
form a valid pair. I should point out that Java is slightly clever. For
example, the UTF-8 converter knows that U+D800 U+DC00 represents the scalar
value U+1 and encodes it as a valid four byte sequence: f0-90-80-80 (and
vice versa, of course).

However, it is unclear how Unicode 3.1 support is going to make it into JDK
1.4++. The APIs are going to have to change to support the supplemental
planes and the ripple effects on various APIs seems like an interesting
problem. Perhaps they'll redefine an int to be a 32-bit value and switch
Java to UTF-32 (yeah, sure.)

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3569 (mobile)
-
Internationalization is an architecture. It is not a feature.


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Yung-Fong Tang
Sent: Monday, October 01, 2001 5:10 PM
To: [EMAIL PROTECTED]
Subject: surrogate at java's property file


Any one know how does Java handle Surrogate pair property file ?

Java's property file use the \u encoding for non ASCII characters,
therefore U+00a5 is \u00A5. I wonder anyone know how does it handle
Surrogate Pair?

Does U+1 (0xd800 0xdc00) encoded as "\u1" or "\ud800\udc00" ? (I
think it should be \u1) or they cannot handle them at all ?

CESU-8: to document or not

2001-09-17 Thread Addison Phillips [wM]


Folks,

I've been following this thread for awhile and it seems that I can make a small 
contribution.

Several comments have been made about why we should NOT document this and give it some 
kind of official imprimatur. I agree that it will generate more confusion and may be 
used in unforeseen ways by unwary people who don't take time to read the documentation.

However: the comments about this encoding being confined to the Evil Doers Who 
Practice It is faulty. Here at webMethods we have something like 90 product 
"adapters": pieces of software that talk to a specific application. As a result, I am 
aware of the vast range of variation in character set and encoding support available 
to product designers. One problem that we are approaching is that the changes to UTF-8 
(to prohibit non-shortest-form) *are* changes and that the products I work on do not 
have the option of rejecting "malformed" data. Adapters must accept the way in which 
Oracle or Peoplesoft have implemented their system (for example) and deal with it 
correctly, with a minimum loss of data.

By providing a documented, standard way to refer to legacy versions of these products 
and their encodings, I can more readily rely on having a well-documented range of 
protocols and procedures for converting and validating data exchanged with these 
systems. The argument that these products "merely support an older version of the 
Unicode standard" is specious, because the older versions merely made the six-byte 
form permissable by way of omission (the six-byte form was *never* the preferred 
form). The older versions say nothing about mixing the two forms, for example. Whether 
we dignify this encoding with a name or not, someone needs to fully document the rules 
and provide a stable basis for supporting this usage. 

For what it's worth, I thank Toby for braving the heat to produce this document. As a 
practical matter, I don't support the creation of new CESU-8 systems and will be 
grappling for a place on the walls to throw hot oil down on the barbarians who propose 
them, but for supporting our existing legacies (which cannot merely be extinguished 
"in the next release"), I think the effort is valuable. And the wording of the UTR 
seemed restrictive enough to me, at least, to be able to support the UTR (since it 
provides me the ammunition to oppose its adoption in practice).

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature. 
webMethods--THE Software Integration Company

RE: What code point is assigned for the Newton unit?

2001-09-12 Thread Addison Phillips [wM]


Hi Stefan,

Actually, you're making an understandable but incorrect assumption. The various units 
characters that look exactly like "normal" characters or sequences of characters you 
find scattered around Unicode are there for one reason only: they provide 
compatibility with existing (legacy) character sets. If you look even more closely at 
the Unicode character database, you'll find that most of these characters have 
"pointers" back to the "real" character. That's why you find most of them in blocks 
called "compatibility"---they only exist to provide backward compatibility (round trip 
conversion to and from) existing character sets and encodingss. In UNICHAR.TXT, look 
at the last fields (for Normalization Form KC and KD respectively) and you'll see that 
U+212B is mapped to U+00c5. You'll also see that the "kg" sign in CJK, for example, is 
mapped to the letter "k" followed by the letter "g".

So, the short answer to your question is: the symbol for "Newton" is the letter "N" or 
U+004E, since no one saw fit to create a separate character called "newton" that 
looked just like an "N", but with a different semantic meaning prior to the creation 
of Unicode. And there should not be one created now, because the letter "N" contains 
all of the useful information necessary for that purpose.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.


-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Stefan Persson
Sent: Wednesday, September 12, 2001 8:59 AM
To: [EMAIL PROTECTED]
Subject: What code point is assigned for the Newton unit?


Hi!

I recently noticed, that the Unicode does difference between the Swedish
capital letter "Å" (U+00C5; Å) and the Ångström sign (U+212B; Å). So it
seems that every unit sign has got it's own code point, while the Latin
letters with exactly identical shape to those have other code points. For
example, the CJK Compatibility block contains some unit signs (in katakana):

㌂: anpea/Ampère
㌕: kiroguramu/kilogram
etc.

So, can someone tell me the code points for the Newton unit sign (which
looks exactly like an "N")? And can someone tell me why it's necessary to do
this difference?

"Ångström" is spelled wrong on the code charts at Unicode's home page, BTW.

Stefan


_
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com

RE: Adding fonts to JVM

2001-09-04 Thread Addison Phillips [wM]




Hi Sameer,
 
The "font list" in your JVM comes from the operating 
system. The list you see is what the JVM thinks you have installed locally. To 
add a font to the list, you have to install the font.
 
Best Regards,
 
Addison
 

Addison P. PhillipsGlobalization Architect / Manager, 
Globalization EngineeringwebMethods, Inc.  432 Lakeside Drive, 
Sunnyvale, CA+1 408.962.5487 (phone)  +1 408.210.3659 
(mobile)-Internationalization 
is an architecture. It is not a feature.

  -Original Message-From: [EMAIL PROTECTED] 
  [mailto:[EMAIL PROTECTED]]On Behalf Of SameerSent: 
  Tuesday, September 04, 2001 4:38 AMTo: Unicode 
  ListSubject: Adding fonts to JVM
  Hi all
   
  I have an applet which contains a drop down 
  list having all fonts present in my JVM. Now I need to add another font to 
  this drop down. This would mean adding the font to the JVM. How can this be 
  achieved?
  I have tried modifying the font.properties file 
  but it does not seem to work. Even if all content of the font.properties file 
  is deleted, still the applet shows the list of fonts. This probably means that 
  the applet does not pick the list of fonts from the font.properties 
  file.
  Does this have anything to do with the Unicode 
  Encoding of my browser?
   
  Please help
   
  Thanks in anticipation,
  Sameer Kachroo
  Srijan Software 
Consultants

RE: japanese xml

2001-08-30 Thread Addison Phillips [wM]

That's not what he said in the responses *I* read. Perhaps I missed one on
this thread. As near as I recall, Misha wrote:

"Of course, EUC (EUC-JP in the
case of Japanese) may cover all the characters you require, in which
case there is no problem.  Additionally, if you are thinking of XML (or
HTML) then you can encode *all* Unicode characters in an EUC-encoded
document, by employing numeric character references for characters
outside the EUC character repertoire."

IOW> That's not "EUC-unicode". I don't see a mention anywhere of that
(hypothetical) encoding. That's "EUC-JP with characters outside EUC-JP
represented as NCRs", and our parser handles that quite well...

Addison

-Original Message-
From: Ayers, Mike [mailto:[EMAIL PROTECTED]]
Sent: Thursday, August 30, 2001 10:00 AM
To: 'Addison Phillips [wM]'
Cc: [EMAIL PROTECTED]
Subject: RE: japanese xml

> From: Addison Phillips [wM] [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, August 30, 2001 09:51 AM

> 4. However, you can use any other encoding, provided you tag the file
> appropriately (so that the parser knows what the encoding is and can
> translate it to its internal representation).

Slight but relevant correction:  you can use any encoding of which
the parser is aware.

> 5 You are not required to use EUC-JP for your Japanese XML
> files: you can
> use the Unicode encodings directly. In some cases, though, your file
> editting software may make it easier to work with EUC-JP (or
> Shift-JIS/Microsoft Code Page 932).

Misha was not talking about EUC-JP, rather EUC-unicode (or some name
like that), which encodes unicode scalar values using the EUC method, and
uses character references for those values (most of them) that are outside
of the EUC encoding range.  Have you tested your parser against that?

/|/|ike

RE: japanese xml

2001-08-30 Thread Addison Phillips [wM]

Hi Mike,

Perhaps I can rephrase Misha's answer ;-):

1. EUC-JP is an encoding ("charset") that was originally created to encoding
Japanese character sets such as JIS X 208 and JIS X 212.
2. As such, EUC-JP can be used to encode the subset of Unicode that contains
all of the characters in JIS X 208 and JIS X 212, etc.
3. An XML parser uses the Unicode character set internally to represent and
process character data. As such, the most natural encoding to use for an XML
file would be a Unicode encoding such as UTF-8 or UTF-16.
4. However, you can use any other encoding, provided you tag the file
appropriately (so that the parser knows what the encoding is and can
translate it to its internal representation).
5 You are not required to use EUC-JP for your Japanese XML files: you can
use the Unicode encodings directly. In some cases, though, your file
editting software may make it easier to work with EUC-JP (or
Shift-JIS/Microsoft Code Page 932).

As for an XML parser that handles all of these, I know from extensive
testing that ours does. And it is worth mentioning, becuase, in fact,
EUC-JP (and many other encodings) are perfectly interoperablefor the
subset of characters that they represent. Most XML interchanges (for
example, marketplaces such as CommerceOne or Ariba) tend to prefer that
"legacy encoded" files be converted to UTF-8 for interoperability, but there
is no requirement that one do so and many backend XML systems, *especially*
in Japan, use the non-Unicode encodings.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of [EMAIL PROTECTED]
Sent: Thursday, August 30, 2001 8:37 AM
To: Ayers, Mike
Cc: [EMAIL PROTECTED]
Subject: RE: japanese xml

I have no idea of what you're talking about.

Misha

On 30/08/2001 16:11:14 "Ayers, Mike" wrote:
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
> > Sent: Thursday, August 30, 2001 06:06 AM
>
> > IMO, I correctly replied to Viranga's question and I've
> > no idea what you're talking about below.
>
>Let me try to put it another way.  What you said may have been
> technically correct, but it was probably not worth mentioning because it
> represents a noninteroperable encoding.  Perhaps I am mistaken though - do
> you know of an XML parser that can parse the encoding that you suggested?
>
>
> /|/|ike

-
Visit our Internet site at http://www.reuters.com

Any views expressed in this message are those of  the  individual
sender,  except  where  the sender specifically states them to be
the views of Reuters Ltd.

RE: Unicode in Asia Question

2001-08-01 Thread Addison Phillips [wM]


Hi Danny,

Implementing Unicode is a good thing for creating multilingual applications
and for supporting code that is distributed worldwide (or at least to a
number of locales). Based on your questions below, you probably should start
with the Unicode FAQ (on the website) and with the standard book itself (The
Unicode Standard, Version 3.0). You might also want to look at Ken Lunde's
excellent book _CJKV Information Processing_, which explains all about
encodings used in Asia and how they relate to each other.

WRT web browers, etc., it is more common to use the multibyte encoding of
Unicode (called UTF-8) for HTML applications. Most web development
environments support UTF-8 pretty well. Note that the encoding at the
browser says nothing about the internal processing of your system, which may
use the 16-bit encoding of Unicode (called UTF-16 and formerly called
UCS-2).

There is nothing wrong with using "legacy encodings" (which is what
Unicoders call encodings that aren't Unicode;-)) for your HTML interface.
This may make it easier for users in country to view the web pages without
adjusting their browser's font settings and preferences. Ancient browsers
(Netscape and IE 4.x and earlier) defaulted to using a font for Unicode that
only supported Latin (Western European) characters, so Asian users sometimes
would see black squares instead of their own characters unless they adjust
their settings.

It should be noted that, until Unicode 3.1 came out recently, there were a
number of characters encoded in some of the legacy encodings you cite which
were not included in Unicode. Support for Unicode 3.1 is planned for most
environments--eventually--but mostly this support is unavailable at present.
These characters that I just mentioned are generally considered quite rare,
but you should be aware of it as a potential objection that Asian users
might have to a pure Unicode approach.

Good luck with your implementation.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature. 

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Magda Danish (Unicode)
> Sent: Wednesday, August 01, 2001 10:34 AM
> To: [EMAIL PROTECTED]
> Subject: FW: Unicode in Asia Question
> 
> 
> 
> 
> -Original Message-
> From: NORIEGA,DANNY (A-HongKong,ex1) 
> [mailto:[EMAIL PROTECTED]] 
> Sent: Wednesday, August 01, 2001 2:56 AM
> To: '[EMAIL PROTECTED]'
> Subject: Unicode in Asia Question
> 
> 
> Hi:
> 
> My company is planning to implement 16-bit Unicode.  The 
> proposal is to
> go strictly and solely with Unicode (16 bit Unicode for Asia/Japan).
> 
> Up to this point we have specified the following encodings:
> - Big 5 (Traditional Chinese)
> - GB 2312 (Simplified Chinese)
> - Shift - JIS (Japanese)
> - KSC 5601 1967 (Korean)
> - Iso-8859-1 (Western character sets)
> - Unicode (we believe this is used for Russian)
> 
> I do not fully understand the need for the various encodings. 
>  I believe
> there are local preferences for browsers (vendors, versions, plugins,
> etc.) that are related to encoding. I have also heard there is some
> need, in Japan for example, where web users routinely view the HTML
> source and expect Shift-JIS.
> 
> Can you confirm what browser preferences (encoding driven) are user
> musts? By this I mean IE 4.0+, Netscape, Mosiac, KK Man & etc..  And,
> what are the customer needs that would make an specific encoding
> (Unicode, Big 5, GB 2312 or KSC 5601 1967) a must?
> 
> The basic question I'm trying to answer is if we move forward 
> with using
> strictly Unicode, will my customers in Asia be adversely affected by
> this decsion?  Will they not be able to view content I place on my
> website.
> 
> Best regards,
> 
> Danny Noriega
> Asia eBusiness Manager
> Agilent Technologies Hong Kong Ltd.
> [EMAIL PROTECTED]
> 
> 

 winmail.dat

RE: RTF language codes

2001-07-23 Thread Addison Phillips [wM]


Hi John,

The IDs correspond exactly to Microsoft's (proprietary) LCID ("locale ID")
codes used in the Windows operating systems. So they are a "Microsoft
standard".

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of jgo
> Sent: Monday, July 23, 2001 4:04 PM
> To: [EMAIL PROTECTED]
> Subject: RTF language codes
>
>
> I was reading:
>
> http://msdn.microsoft.com/library/specs/rtfspec_16.htm#rtfspec_34
>
> and trying to figure out the RTF language codes when I found:
> "The following table defines the standard languages used by Microsoft.
> This table was generated by the Unicode group for use with TrueType
> and Unicode.
> LanguageID (hexadecimal)   ID (decimal)
> Afrikaans   0x0436  1078
> Albanian0x041c  1052
> Arabic  0x0401  1025
> Arabic Algeria  0x1401  5121
> Arabic Bahrain  0x3c01 15361...
> Xhosa   0x0434  1076
> Yiddish 0x043d  1085
> Zulu0x0435  1077"
>
> I don't see such a table via search from the Unicode site.
> Is this just another M$ non-standard "standard" subject to
> change at a whim?  (Does the consortium have anything to do
> with it at all?)
>
> Since I'm trying to use it with MacOS 9-, how does this fit
> with Apple's script and region codes... or does it?  (And
> where can I find those, if so?)
>
> John G. Otto Nisus Software, Engineering
> www.infoclick.com  www.mathhelp.com  www.nisus.com  software4usa.com
> EasyAlarms  PowerSleuth  NisusEMail  NisusWriter  MailKeeper  QUED/M
>Will
> program Macs for food.
>
>
>
>

RE: Is there Unicode mail out there?

2001-07-11 Thread Addison Phillips [wM]


I think you'll find that Peter's response applies to you too: the mailer is seeing 
Windows-874　on the incoming message and converting your outgoing message to use that 
same encoding (in a bid to be compatible with the original message). Outlook has done 
that for awhile. If you manually set the encoding for the reply you can override that 
behavior. In Outlook 2000 this is "Format | Encoding"

Best　Regards,

Addison

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Ayers, Mike
> Sent: Wednesday, July 11, 2001 12:42 PM
> To: Unicode List
> Subject: RE: Is there Unicode mail out there?
> 
> 
> 
>   Okay, I sent these as UTF-8, with some Chinese where 
> the question
> marks are.  However, the Chinese is getting eaten somewhere 
> along the way.
> Oddly, though, the Thai still displays fine.  Would any 
> Outlook XP guru
> volunteer to help me get back to my international ways?
> 
>   Final test:  
> 
> 
> > From: Ayers, Mike [mailto:[EMAIL PROTECTED]] 
> > 
> > Let's try this again...
> > 
> > > > From: Mark Davis [mailto:[EMAIL PROTECTED]] 
> > > > 
> > > > Yes, that works fine. The Thai comes through clearly: 
> > กลัปมาอยู่แล้ว
> > > > 
> > 
> > Woohoo!!!  UTF-8 party!!!  ???!!!
> > 
> > > 
> > > /|/|ike
> > > 
> > 
> 
>

RE: MIME standards(RFCs) (was..RE: Unicode, UTF-8.....)

2001-07-11 Thread Addison Phillips [wM]


You're right. I just have those numbers memorized from long usage ::sigh::

Addison

> -Original Message-
> From: [EMAIL PROTECTED] 
> [mailto:[EMAIL PROTECTED]]On Behalf Of Jungshik Shin
> Sent: Wednesday, July 11, 2001 12:27 PM
> To: Unicode Mailing List
> Subject: MIME standards(RFCs) (was..RE: Unicode, UTF-8.)
> 
> 
> On Tue, 10 Jul 2001, Addison Phillips [wM] wrote:
> 
>  Let me just point out a little glitch in otherwise excellent 
> answer :-)
> 
> 
> > successfully using Latin-1 even though the message body of 
> a message you
> > compose would normally be in UTF-8. See RFC 1341 and 1342 
> for the details on
> > how such stuff is labeled.
> 
>   Although overall design of MIME standard  remained the same,
> there have been some evolution and I guess it's better to give the
> references to the up-to-date RFCs :-).  IETF RFC 1341/1342 have gotten
> obsoleted by RFC 1521/1522/1590 which have been superceded, in turn,
> by RFC 204[5-9] and RFC 2184/2231/2646 (all of these are available
> at <http://www.ietf.org>)
> 
>   Jungshik Shin
> 
> 
>

RE: Is there Unicode mail out there?

2001-07-11 Thread Addison Phillips [wM]

After all the various replies that say "gosh, I can't read this", I thought
it might be helpful to point out this section of Abijit's email headers:

Content-type: text/plain; charset=us-ascii

The outbound mailer (even in Notes, which is a pretty well internationalized
application, although they bury the settings that control this specific
capability!!) can send UTF-8, as far as I remember, plus a raft of legacy
encodings. In this case either the user's mail client or the mailer itself
is set to send US-ASCII. Since I don't have Notes installed these days, I
can't say where the controls are that change the settings (I certainly don't
remember), but I do recall that I was able, as a Notes user in the past, to
set my encoding. That would quite possibly make the string of eight unknown
characters visible to the list. Note that this has nothing to do with which
mailer you are receiving the mail with or with Sarasvati's capabilities or
anything: the message was converted to nothing before it left the sender.

In most cases in my recent experience, settings on the mailer or mail client
itself prevent a proper Unicode message from being generated. The mailers
themselves rarely care about the encoding: as long as it obeys RFCs
822/1341/1342 they are happy. Most of the more modern GUI mail clients can
handle UTF-8. Yes, there are older or text-mode clients that can't deal with
it, but in my experience it is getting to the point that there are
(generally, generally) more problems with getting the settings set to send
than with receivers receiving!

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.

> -Original Message-
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]]On Behalf Of [EMAIL PROTECTED]
> Sent: Wednesday, July 11, 2001 3:11 AM
> To: Otto Stolz
> Cc: Unicode List
> Subject: Re: Is there Unicode mail out there?
>
>
> Can you read this? This is coming from Lotus Notes.
>
> 
>
>
> Otto Stolz <[EMAIL PROTECTED]> on 07/11/2001 10:43:10 PM
>
> Please respond to Otto Stolz <[EMAIL PROTECTED]>
>
> To:   Unicode List <[EMAIL PROTECTED]>
> cc:   11 <[EMAIL PROTECTED]> (bcc: Dutta Abhijit/India/IBM)
> Subject:  Re: Is there Unicode mail out there?
>
>
>
>
> <[EMAIL PROTECTED]> had asked:
> > Is there Unicode mail out there?
>
> On Sun, 8 Jul 2001 03:40:51 -0700 James Kass wrote:
> > Microsoft's Outlook Express offers many e-mail encoding
> > options, including Unicode (UTF-8) and responding to the
> > sender in the same encoding as the sender's message.  And,
> > it won't cost you money.
>
> Same with Netscape 6.01; though it still has some teething problems.
>
> Best wishes,
>   Otto Stolz
>
>
>
>
>
>

RE: Unicode, UTF-8 and Extended 8-Bit Ascii - Help Needed

2001-07-10 Thread Addison Phillips [wM]


Hi Stephen,

The short answer to your question is "no". The characters between U+0080 and
U+00FF *are* supported by UTF-8 (all Unicode characters are supported by
UTF-8), but they do not use the same code points as Latin-1. If they used
the same code points as Latin-1, they would *be* Latin-1 and there would be
no way to represent the other 1.4 million potential code points in Unicode
;-)

UTF-8 is (7-bit) ASCII compatible, so an ASCII character is itself in UTF-8.
However, all other characters in UTF-8 are represented by a two-, three-, or
four-byte sequence. So the Latin-1 characters in Unicode (above 0x80) are
all represented by two byte sequences.

Now, you might notice two things about your problem.

First: if you pass UTF-8 through a system that expects Latin-1 (and which
will tolerate characters in the C1 control range between 0x80 and 0x9F), you
can usually pass the UTF-8 through and recover it on the "far end". In fact,
this was one of the original design goals of UTF-8.

Second: the reverse is *not* true. It is extremely unlikely, due to the very
specific patterning in UTF-8, that a UTF-8 system will pass Latin-1 cleanly.
This is the situation that you describe below.

Luckily, you can probably still pass your EDIFACT documents successfully
even though your mailer is being converted to use UTF-8 as a default. That's
because your EDI documents are likely to be file attachments and each one
can have its own Content-Type header. Your mailer will be applying a
Transfer-Encoding Scheme (think "base64") to your document to make it 7-bit
clean anyway. As long as your code labels the content with its correct
encoding (by calling the API correctly) you can transfer the document
successfully using Latin-1 even though the message body of a message you
compose would normally be in UTF-8. See RFC 1341 and 1342 for the details on
how such stuff is labeled.

Best Regards,

Addison

===
Addison P. Phillips  Manager, Globalization Engineering
webMethods, Inc.Globalization Architect
+1.408.962.5487 (tel.)  mailto:[EMAIL PROTECTED]
+1 408.210.3569 (mobile)  +1 408.962.5329 (fax)
===
"Internationalization is not a feature. It is an architecture."

-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
Behalf Of Stephen Cowe - Sun Scotland
Sent: Tuesday, July 10, 2001 3:53 AM
To: [EMAIL PROTECTED]
Subject: Unicode, UTF-8 and Extended 8-Bit Ascii - Help Needed


Hi Unicoders,

I am new to the list and would be really grateful if you could help me out
here.

I am trying to discover if the "extended latin" 8-bit ascii (decimal
values 128-255, Hex A0-FF), i.e. ISO-8859-1 are supported by UTF-8, and
if so, are the values the same.

The reason why I am asking this is because our EDIFACT EDI system
requires to send extended latin European characters (using the UNOC version
3
syntax identifier) and our global internal messaging system is being
converted
to UTF-8.

I have had a good search of the Unicode web-site but do not seem to be able
to
find the answer, yes or no, that I require.

I look forward to hearing from you, kind regards,

Stephen Cowe.

eCommerce Technologist
GSO IT EDI/EDE
+44 (0)1506 672541 (Tel)
+44 (0)1506 672893 (Fax)
[EMAIL PROTECTED]

RE: Normalization and the sample code

2001-06-13 Thread Addison Phillips [wM]


"Never Mind"

I found that recompiling the data tables for ICU4J fixed the problem. It wasn't very 
helpful that I had a slightly older ICU4J. The one they released tonight is fine.

This does mean that the sample Java applet on the Unicode website has this problem and 
ought to be corrected. The tables and charts, of course, are correct, but it took a 
lot of spelunking to figure out that I wasn't crazy and #2 below was what it was.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
"Our opportunity is to integrate up to 75% of the world's economy. Is anyone excited 
yet?" 
--- Phillip Merrick, CEO webMethods.

> -Original Message-
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
> Behalf Of Addison Phillips [wM]
> Sent: Wednesday, June 13, 2001 3:15 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Normalization and the sample code
> 
> 
> All,
> 
> I have been playing with the sample code for normalization in 
> UAX15 and the ICU4J classes that are, shall we say, "closely 
> related" to the sample code.
> 
> If I ask for NKFC of the string U+0060 or U+005F (or of 
> U+FF40 and U+FF3F, which are the wide equivalents and the 
> initial source of my woes), I get the sequence U+0020 U+0300 
> (or U+0020 U+0332). The wording of the UAX implies that this 
> is the "correct" behavior, as long as you don't consider the 
> non-spacing marks to be a "combination" of space and the 
> non-spacing version of the character.
> 
> The conformance test file says that FF40 and FF3F should 
> become 0060 and 005F, but nothing about 0060 and 005F 
> ultimately. Neither does it handle 0020 + 03xx in any way.
> 
> So, what's right?
> 
> 1. I should get the sequence I get; or
> 2. There is a bug in the code; or
> 3. There is an omission in the tables.
> 
> Best Regards,
> 
> Addison
> 
> Addison P. Phillips
> Globalization Architect / Manager, Globalization Engineering
> webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
> +1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
> -
> Internationalization is an architecture. It is not a feature. 
> 
> 
>

Normalization and the sample code

2001-06-13 Thread Addison Phillips [wM]


All,

I have been playing with the sample code for normalization in UAX15 and the ICU4J 
classes that are, shall we say, "closely related" to the sample code.

If I ask for NKFC of the string U+0060 or U+005F (or of U+FF40 and U+FF3F, which are 
the wide equivalents and the initial source of my woes), I get the sequence U+0020 
U+0300 (or U+0020 U+0332). The wording of the UAX implies that this is the "correct" 
behavior, as long as you don't consider the non-spacing marks to be a "combination" of 
space and the non-spacing version of the character.

The conformance test file says that FF40 and FF3F should become 0060 and 005F, but 
nothing about 0060 and 005F ultimately. Neither does it handle 0020 + 03xx in any way.

So, what's right?

1. I should get the sequence I get; or
2. There is a bug in the code; or
3. There is an omission in the tables.

Best Regards,

Addison

Addison P. Phillips
Globalization Architect / Manager, Globalization Engineering
webMethods, Inc.  432 Lakeside Drive, Sunnyvale, CA
+1 408.962.5487 (phone)  +1 408.210.3659 (mobile)
-
Internationalization is an architecture. It is not a feature.

86 matches

Mail list logo