RE: UNICODE BOMBER STRIKES AGAIN

2002-04-24 Thread Yves Arrouye


 You can determine that that particular text is not legal UTF-32*,
 since there be illegal code points in any of the three forms. IF you
 exclude null code points, again heuristically, that also excludes
 UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE,
 16LE as the only remaining possibilities. So look at those:
 
 1. In UTF-16LE, the text is perfectly legal Ken.
 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀.
 
 Thus there are two legal interpretations of the text, if the only
 thing you know is that it is untagged. IF you have some additional
 information, such as that it could not be UTF-16LE, then you can limit
 it further.

Actually, I also think that without any external information about the
encoding except that it is some UTF-16, it *has to* be interpreted as being
most significant byte first. I agree that it could be either UTF-16LE or
UTF-16BE/UTF-16, but in the absence of any other information, at this point
in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader has no
choice but to declare it UTF-16.

Now what about auto-detection in relation to this conformance clause?
Readers that first try to be smart by auto-detecting encodings could of
course pick any of these as the 'auto-detected' one. Does that violate 3.1
C3's interpretation of bytes? I would say that as long as the auto-detector
is seen as a separate process/step, one can get away with it, since by the
time you look at the bytes to process the data, their encoding has been set
by the auto-detector.

YA





RE: browsers and unicode surrogates

2002-04-23 Thread Yves Arrouye

 | I am surprised by the must only be used. It seems I am not
 | conforming by including a meta statement in the utf-16 HTML page. I
 | should either remove the statement or encode the HTML up to and
 | including that statement as ascii. I'll check on this.
 
 It doesn't make much sense to have the meta statement there, as I
 would expect most browser to assume ASCII compatibility, but I agree
 that must only be used sounds too harsh.
 
 [...]
 
 it struck us: if we can see that the page claims to be UTF-16, it
 can't be, because our meta declaration scanning assumes ASCII
 compatibility.

I think you just answered why the spec says must only be used :) it is so
that the parsing of the meta tag can happen with predictability.

YA





RE: SCSU compression (WAS: RE: Thai word list)

2002-04-19 Thread Yves Arrouye

 This looks like a nice endorsement of SCSU:

:D

 It saves 59% just as a charset,
 and it saves almost 20% in a system with a real compression.

I am all for SCSU as a charset (after my tools can view it properly), but
that was not the use there. OTOH there is gzip encoding in HTTP 1.1 :)
Seriously, SCSU is fine for some uses, but in this example, was definitely
not the best way to appreciate a reduction in file size.

By the 20% you mean an additional 20% by doing SCSU+gzip versus just gzip,
right?

YA





RE: Japanese and Chinese and ... word lists (WAS RE:Thai word list)

2002-04-18 Thread Yves Arrouye

Since we're on this topic, what about sources for other languages where a
dictionary is needed to do word breaking? I'd be interested in Chinese and
Japanese myself for instance,

YA






RE: Thai word list

2002-04-18 Thread Yves Arrouye

 If you can process SCSU, and would appreciate a 59% reduction in file
 size, try:
 
 http://home.adelphia.net/~dewell/th18057-scsu.txt(135,731 bytes)

Not to knock down SCSU, but if it had been gzipped instead, the resulting
file would be about half that size: 70,912 bytes. (The gzipped SCSU-encoded
file is 57,987 itself).

YA 





RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

  The last time I read the Unicode standard UTF-16 was big endian
  unless a BOM was present, and that's what I expected from a UTF-16
  converter.
 
 Conformance requirement C2 (TUS 3.0, p. 37) says:
 
[And other many good references where TUS does *not* say that :)]

OK, maybe in 2.0, or I made an assumption about network byte order. Or maybe
I read this too:

 I do remember reading once, somewhere, that big-endian was a preferred
 default in the absence of *any* other information (including platform of
 origin).  But I can't find anything in the Unicode Standard to back this
 up, so I'll assume for now that both byte orientations are considered
 equally legitimate.

Thanks for getting the references and checking Doug.

YA





RE: MS/Unix BOM FAQ again (small fix)

2002-04-10 Thread Yves Arrouye

 The reason for ICU's UTF-16 converter not trying to auto-detect the BOM
 is that this seems to be something that the _application_ has to decide,
 not the _converter_ that the application instantiates.
 This converter name is (currently) only a convenience alias for use the
 UTF-16 byte serialization that is normally used on this machine.

I agree that the application may know better. It is just unfortunate that
the name is not UTF-16PE to remind people that it is about platform
endianness (sp?). Also, when used in a script using say uconv, the script
does not have access to ucnv_detectUnicodeSignature(), so you end up in a
situation where you get a file identified as being in UTF-16 but when you
use the UTF-16 converter it may not be readable. If instead you had
UTF-16PE as the convenience name for the platform endian UTF-16, and
UTF-16 handle the BOM and default byte order expectation (conformance
clause C3 of TUS) then it'd be much easier on newcomers.

YA





RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

 D43 italUTF-16 character encoding scheme:/ital the Unicode
 CES that serializes a UTF-16 code unit sequence as a byte sequence
 in either big-endian or little-endian format.
 
   * In UTF-16 (the CES), the UTF-16 code unit sequence
 004D 0430 4E8C D800 DF02 is serialized as
 FE FF 00 4D 04 30 4E 8C D8 00 DF 02 or
 FF FE 4D 00 30 04 8C 4E 00 D8 02 DF or
 00 4D 04 30 4E 8C D8 00 DF 02.
 
etc., etc.

So same semantics as before. In the absence of any indication of what byte
order is used, assume big endian.

YA
 




RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

And of course, I have been complaining about ICU's UTF-16 converter
behavior, but glibc's one does the same assumption that UTF-16 is in the
local endianness:

gabier% echo hello | uconv -t utf-16be | iconv -f utf-16 -t ascii
iconv: illegal input sequence at position 0
gabier%

So fixing one but not the other may introduce different compatibility
problems, this time on the local platform. Ugh.

YA





RE: Default endianness of Unicode, or not

2002-04-10 Thread Yves Arrouye

  So same semantics as before.
 
 Yep. The editorial committee would't be doing its job right
 if it were changing the semantics of the standard.

Agreed! Is there any mention that the non-BOM byte sequence is most
significant byte first anywhere else? You know, for the newbies?
 
 Joshua 1.8
 
 This book of the law shall not depart out of thy mouth; but
 thou shalt meditate therein day and night, that thou mayest
 observe to do according to all that is written therein: for
 then thou shalt make thy way prosperous, and then thou shalt
 have good success. (King James)
 
 --
 
 Keep this book of the law on your lips. Recite it by day and
 by night, that you may observe carefully all that is written
 in it; then you will successfully attain your goal.
 (New American Bible)

I think in this case, the semantics change from meditate (which implies
reflection and intelligence) to recite (as I've done blindly as a student)
is either unfortunate or telling. Pick one. (Not that you can't meditate on
something you know by heart; I just think meditate is better.)

YA

(From Merriam-Webster, http://www.m-w.com/:)

Main Entry: med*i*tate 
Pronunciation: 'me-d-tAt
Function: verb
Inflected Form(s): -tat*ed; -tat*ing
Etymology: Latin meditatus, past participle of meditari, frequentative of
medEri to remedy -- more at MEDICAL
Date: 1560
intransitive senses : to engage in contemplation or reflection
transitive senses
1 : to focus one's thoughts on : reflect on or ponder over
2 : to plan or project in the mind : INTEND, PURPOSE
synonym see PONDER
- med*i*ta*tor  /-tA-tr/ noun

Main Entry: re*cite 
Pronunciation: ri-'sIt
Function: verb
Inflected Form(s): re*cit*ed; re*cit*ing
Etymology: Middle English, to state formally, from Middle French or Latin;
Middle French reciter to recite, from Latin recitare, from re- + citare to
summon -- more at CITE
Date: 15th century
transitive senses
1 : to repeat from memory or read aloud publicly
2 a : to relate in full recites dull anecdotes b : to give a recital of :
DETAIL recited a catalog of offenses
3 : to repeat or answer questions about (a lesson)
intransitive senses
1 : to repeat or read aloud something memorized or prepared
2 : to reply to a teacher's question on a lesson
- re*cit*er noun  




RE: MS/Unix BOM FAQ again (small fix)

2002-04-09 Thread Yves Arrouye

 This is incorrect. Here is a summary of the meaning of those bytes at
 the start of text files with different Unicode encoding forms.
 
 beginning with bytes FE FF:
 - UTF-16 = big endian, omitted from contents
 
 beginning with bytes FF FE:
 - UTF-16 = little endian, omitted from contents

Unfortunately this breaks with popular Unicode libraries like ICU (I am
Cc:ing them here, since I have the opportunity to raise this again), where
UTF-16 is mapped to the platform endian form:

(From ICU's convrtrs.txt file:)

# The ICU UTF-16 converter uses the current platform's endianness.
# It does not autodetect endianness from a BOM.
UTF-16 { MIME }  UTF16_PlatformEndian ISO-10646-UCS-2 { IANA }
csUnicode ibm-17584 ibm-13488 ibm-1200 cp1200 ucs-2

(End of excerpt.)

This is typically *very* confusing to new users of Unicode. I wish such
libraries used only a UTF-16PE denomination for such a converter, and
handled UTF-16 as a converter per the expectations that Mark described well
in his explanation of how to interpret a FF FE / FE FF sequence of bytes.
Otherwise you end up having people properly label UTF-16 some UTF-16 with
a BOM, and naive code using the library's UTF-16 converter (sounds
appropriate, right?) fail to decode data properly.

In the context of ICU, it's one of my favorite pet peeves, especially since
the ICU is usually so a*al about being very strict as far as the
interpretation of a given charset name goes. The last time I read the
Unicode standard UTF-16 was big endian unless a BOM was present, and that's
what I expected from a UTF-16 converter.

YA





RE: Collation - last character?

2002-03-22 Thread Yves Arrouye

 TUS does not prevent anyone to put noncharacter code points in Unicode
 strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved
 for
 private program use as a sentinel or other signal. I would expect this
 to
 hold true for the noncharacters that were introduced later too. It may
 not
 fit your needs if you're looking for a character, but it is available for
 use by applications.
 
 But it is *not* available to *users* to put into lists to make certain
 elements sort at the end.

When dealing with user-specified lists, I would if possible introduce some
markup so that my application can deal with those two special cases
(lowest/highest) as it wishes internally without burdening the user with the
need to enter an improbable (in her everyday's context) codepoint.

YA





RE: Collation - last character?

2002-03-19 Thread Yves Arrouye

 Markus Scherer wrote:
  How about U+10?
  It is a non-character, which gives it a high (unassigned
  character) weight in the UCA. It is the highest code point =
  the last character.
 
 That is definitely not what I was looking for. It is an illegal codepoint,
 while I was looking for a legal codepoint, and one that would not 'happen
 to
 be' the last, but would be 'defined as' last.

TUS does not prevent anyone to put noncharacter code points in Unicode
strings. As a matter of fact, p. 23 of TUS 3.0 reads U+ is reserved for
private program use as a sentinel or other signal. I would expect this to
hold true for the noncharacters that were introduced later too. It may not
fit your needs if you're looking for a character, but it is available for
use by applications.

YA





RE: Standard Conventions and euro

2002-03-02 Thread Yves Arrouye

 The old currencies on the continent (German Mark, Dutch guilder, French
 frank) however use a period to devide the groups and a comma as a decimal
 sign
 
 Some use a full stop as the thousands separator and some use a
 numeric (nonbreaking) space Switzerland uses an apostrophe for the
 thousands separator, I believe

Yes, Switzerland uses an apostrophe

France does use a comma for the decimal separator, but uses a non-breaking,
non expansible (constant size) space to group digits 3 by 3, and not a dot:

1 799 237,59

Check your Palm if you have one Last time I looked, their number formats
were okay

YA





RE: Standard Conventions and euro

2002-03-01 Thread Yves Arrouye

  listing the way I wanted it.  *nix systems that start with fr_FR and
  then allow you to define fr_FR-EURO or something really aren't much
  better; what if I want to deviate from the pre-defined locale in four or
  five ways instead of just one?

They do not let you deviate from a pre-defined locale in one way. They have
two pre-defined locale whose names are fr_FR and fr_FR-EURO (fr_FR@EURO),
and you can simply select one or the other. Anybody's free to write an
fr_FR@MYTASTE locale that customizes fr_FR, and use that.

YA





RE: Standard Conventions and euro

2002-03-01 Thread Yves Arrouye


 On Fri, 1 Mar 2002 11:26:42 +0100 , Marco Cimarosti wrote:
  French francs amounts were often
  written with a single decimal (because the smallest coin was 10 cents)
 
 No, the 5 centime coin remained in use (until the recent demise of the
 Franc, of course) and in any case it was very rare to see amounts
 written (or displayed) with anything other than 2 decimals

And we even had some 399 FF prices, even though we couldn't pay them in
cash What happened is that the store would sum up everything you buy and
then round down to whatever could be paid in hard currency A good deal: two
399 FF items would set you back 795 FF, versus 790 FF if they had been
priced at 395 FF to start with Multiply that by millions of sales
monthly

YA





RE: Unicode page Web ring?

2002-03-01 Thread Yves Arrouye

 My page is in Unicode, but does not mention Unicode except in the headers,
 and the headers are invisible unless you choose view source in your
 browser

My company service has been in UTF-8 since I joined in 1998 See
http://wwwrealnamescom/; Another good example, but it's much more recent:
http://wwwmsncom/;

YA





RE: ISO 3166 (country codes) Maintenance Agency Web pages move

2002-02-28 Thread Yves Arrouye

 I'm confused.  Do you mean meaningless identifiers? They look
 meaningless to me.  House numbers in North America (and in France
 also, it seems) have a few bits of meaning: the least-significant
 (numeric) bit tells you which side of the street the house is on,
 and it's often the case that you can deduce the cross street
 from the house number.  Similarly with the others.

Until, that is, when some smart beep decides to renumber everybody by
counting the distance in meters from the start of the street to the house.
We've got a house that went from an odd to an even number this way. Not to
mention people wondering why the neighbor of 650 SomeStreet was at 615...

YA





RE: Standard Conventions and euro

2002-02-28 Thread Yves Arrouye

 Perhaps not as physical currency, but they sure do still exist in data,
 and will continue to exist in data until the Apocalypse.
 
 When is that scheduled to occur?
 
 [Alain]  Very simple: « la semaine des quatre jeudis » (the week of the 4
 Thursdays, as we say in French).

And the exact day would be that of St Glinglin. (Still a French reference.)
YA





RE: Unicode and end users

2002-02-16 Thread Yves Arrouye

 If foo is a US-ASCII string, grep foo file will work fine with any
 US-ASCII-superset charset for which non-ASCII characters do not use
 bytes  0x80, including the hypothetical one I described, with no
 possibility of a false match. However grep fóó file will work only
 if the current shell charset (i.e. of argv[1]) matches the encoding of
 file.

Not necessarily. It will work as long as the sequence of 3 bytes fóó is the
representation of the string you are looking for in the file, in that file's
encoding. grep does not validate anything, nor should it IMHO. If you want
to guarantee the encoding, use a converter like ICU's uconv(1) or iconv(1).

YA





RE: This spoofing and security thread

2002-02-14 Thread Yves Arrouye

 The very fact that most of them can be reduced to ASCII and people still
 find the resulting text useful and accurate to the original is a sign
 that the important characters in English are in ASCII. And all the
 standard transliterations - em-dashes - --, c-cedilia - c, e-acute,
 e-grave - e, o-umlaut - o, shaped quotes -  and ' - are from
 characters in Windows-1252.

Well, wouldn't you expect an American standard to properly encode the
important characters for English? I would. Only ISO has the luxury of
encoding Western Europe languages without catering properly to French and
some Nordic language (sorry, forgot which; as for French, I am referring to
the lack of oe ligature in iso-8859-1).

YA





RE: Unicode and end users

2002-02-14 Thread Yves Arrouye

 UTF-8 should *never* contain the BOM.

But has been pointed out, it is common practice for Microsoft, and also for
ICU's genrb tool, for example, which uses the BOM to autodetect the
encoding. The more example you'll see of that, the more people will use the
BOM (now, can't we all use -*- coding: utf-8 -*- ;-)?).

YA





RE: This spoofing and security thread

2002-02-13 Thread Yves Arrouye

 What do you mean? I've done works for Project Gutenberg, and looked at a
 number of books with thoughts of reducing them to ASCII. In my opinion,
 Windows-1252 has every character that most English books will need,

Especially those books that you want to reduce to ASCII :-)
YA





RE: UTF-16 is not Unicode

2002-02-12 Thread Yves Arrouye

 A ideal interface should probably automatically and silently select
 Unicode
 (and its default UTF) whenever one or more of the characters in a document
 are not representable in the local encoding.

I beg to differ. Silently doing such an unexpected change is guaranteed to
confuse the user, especially as she starts exchanging the files or loading
in other programs. The interface should warn the user and offer a couple
sensible choices, one of them (and maybe the default) being to save using
one of the UTFs.

YA





RE: Unicode and Security: Domain Names

2002-02-08 Thread Yves Arrouye

Moreover, the IDN WG documents are in final call, so if you have comments to
make on them, now is the time. Visit http://www.i-d-n.net/ and sub-scribe
(with a hyphen here so that listar does not interpret my post as a command!)
to their mailing list (and read their archives) before doing so.

The documents in last call are:

1. Internationalizing Domain Names in Applications (IDNA)
http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt

2. Stringprep Profile for Internationalized Host Names
http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt

3. Punycode version 0.3.3
http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt

4. Preparation of Internationalized Strings (stringprep)
http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-00.txt

and the last call will end on Feb 11th 2002, 23h59m GMT-5. There is little
time left.

YA





RE: Unicode and Security: Domain Names

2002-02-08 Thread Yves Arrouye

Moreover, the IDN WG documents are in final call, so if you have comments to
make on them, now is the time. Visit http://www.i-d-n.net/ and subscribe to
their mailing list (and read their archives) before doing so.

The documents in last call are:

1. Internationalizing Domain Names in Applications (IDNA)
http://www.ietf.org/internet-drafts/draft-ietf-idn-idna-06.txt

2. Stringprep Profile for Internationalized Host Names
http://www.ietf.org/internet-drafts/draft-ietf-idn-nameprep-07.txt

3. Punycode version 0.3.3
http://www.ietf.org/internet-drafts/draft-ietf-idn-punycode-00.txt

4. Preparation of Internationalized Strings (stringprep)
http://www.ietf.org/internet-drafts/draft-hoffman-stringprep-00.txt

and the last call will end on Feb 11th 2002, 23h59m GMT-5. There is little
time left.

YA





RE: Unicode and Security: Domain Names

2002-02-08 Thread Yves Arrouye

 Are the actual domain names as stored in the DB going to be canonical
 normalized Unicode strings? It seems this would go a long way towards
 preventing spoofing ... 

Names will be stored according to a normalization called Nameprep. Read the
Stringprep (general framework) and Nameprep (IDN application, or Stringprep
profile) for details. This normalization includes a step of normalizing
using NFKC, but it does more than that.

no one would be allowed to register a non-
 canonical
 normalized domain name. Then, a resolver would be required to normalize
 any
 request string before the actual resolve.

To keep the resolver's loads the same as today, client applications will do
the normalization of their requests. If they don't normalize properly, the
lookup will just fail. Read the IDNA document for more info on this.

All normalized strings are encoded in a so-called ASCII Compatible Encoding
which uses the restricted set of characters used in the DNS today (letters,
digits, hyphen except at the extremities) for host names (which are
different than STD13 names, cf. SRV RRs for example). Read IDNA, again, and
Punycode, the chosen encoding.

YA





RE: Unicode and Security

2002-02-06 Thread Yves Arrouye

 Well, nothing wrong with Unicode of course. Just means that there will
 need
 to be an option in your browser to reject any site without a digital
 certificate, and perhaps it will need to be turned on by default. So,

Nothing prevents sites running frauds to get a certificate matching their
name. If the price of certificates drop, or if the fraud has good margins
enough, it will not even be a big inconvenience.

YA





RE: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 Thread Yves Arrouye

 As part of the mystery of CJK encodings I notice that IBM's ICU's 
 uconv and SuSE6.4 linux iconv differ as to the UTF-8 representation 
 if table.euc

 Both converters will round-trip with themselves and give byte exact 
 copy of table.euc

 Weirdly they differ in how they map '\' and '~' in ASCII space as 
 well as some spots in higher characters.

That is understandable if they use different tables. The question is which
one is the right EUC-JP, and which one do users want? ICU, as well as
iconv, could have two tables with the different mappings. The question then
is how to label them, and whether the labeling should be compatible between
the two.

 Linux iconv will not take ICU's UTF-8.
 ICU's uconv will read the iconv output but does produce same as
 original
 table.euc.

I find the same statement confusing. Are you saying that uconv's UTF-8 is
ill-formed? Nick, Would you mind email me (and just me, not the list) your
table.euc sample file?

Thanks,
YA







RE: ICU's uconv vs Linux iconv and UTF-8

2002-02-01 Thread Yves Arrouye

 It is definitely a problem to try to interpret what any given label is
 supposed to be. The problem is that MIME labels and others are
 ambiguous, and are interpreted different ways on different systems.

Still, in the meantime it does make sense to have EUC-JP associated to the
most common interpretation of it, doesn't it? Just for the sake of user
satisfaction?

I am curious: is there a better name for the EUC-JP that ICU is using,
that would make everybody understand which one it is? If so, we could have
EUC-JP for the one that the rest of the world wants.

YA





RE: Introducing the idea of a ROMAN VARIANT SELECTOR (was: Re: Proposing Fraktur)

2002-01-31 Thread Yves Arrouye

 quite a lot of space. However, Fraktur is already encoded in the
 Mathematical whatever-it's-called block. This variant selector would mean
 that lots of characters can be displayed in two *different* ways. I'd
 prefer
 that Fraktur diacritics were added instead, and that the mathematical
 letters were to be used for Fraktur texts.

I hope not. These were encoded there because they convey a specific meaning
when used for mathematics. If you use them to spell out names, then you're
abusing them and potentially confusing software that would rely on their
mathematical semantics.

I think it's time to have another proposal for French, FRENCH VARIANT
SELECTOR, where we do not use Fraktur but some other font variation. And we
may need a QUEBEC VARIANT SELECTOR if they have different rules... Or should
it be a QUEBEC FRENCH VARIANT SELECTOR to show the relationship?

YA





RE: POSITIVELY MUST READ! Bytext is here!

2002-01-28 Thread Yves Arrouye

 Well, I've seen cases where chat engines have
 converted ASCII into emoticon pictures at the wrong
 places...

And sometimes you can't turn them off. Grumble. I couldn't give out sample
code in MSIM using foo(c) for a function call w/o getting a cup of coffee
after foo!

YA





RE: [Very-OT] Re: ü

2002-01-23 Thread Yves Arrouye

 Obviously (I advocate in French changing the spelling of common foreign
 words so that there would be more consistency).
 
 Le ouiquende?
 
 That would be pronounced wikãd... To respect the English pronunciation
 you would have to write it ouiquennde, which would still be a very odd
 spelling in French... The end sound is really not French in itself...

France's Académie française is good at that: they recently invented cédérom
(CD-ROM; gets used because it's quite okay), and mèl (mail, for e-mail;
nobody uses it except to make fun of it).

YA





RE: RE: [Very-OT] Re: ü

2002-01-23 Thread Yves Arrouye

 http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm
 http://www.culture.fr/culture/dglf/dispositif-enrichissement.htm

Thanks for the pointer. Though I can't fine the exact sentence re: the
substantive use I found mél referred to as a symbol for messagerie
électronique. I like courriel a lot. Nice.

YA





RE: Funky characters, Japanese, and Unicode

2002-01-18 Thread Yves Arrouye

 1. I have a Geocities page now. I do not know what encoding Geocities
uses,
 but I think it's unicode. What I did for the Japanese text on it was not
 think about encodings and just type it in with Microsoft's IME (and do
 some
 swearing at the IME at the process). And it comes out fine, for the most
 part. Why does this work? What encoding does it use?

Your browser (which one?) just does a good job of detecting the encoding
used for your page http://www.geocities.com/elevendigitboy/. For instance,
if I view itr with IE after unselecting the Autoselect item of the
ViewEncoding menu, I get garbage as expected. Otherwise, IE does recognize
Shift_JIS.

YA
--
Sailing is harder than flying. It's amazing that man learned how to sail
first. -- Burt Rutan..




RE: Off topic: Whut in tarnation is Unicode?

2002-01-16 Thread Yves Arrouye

Re: elite-speak generator, I meant the one Edward Cherlin posted:

L33t-5p34k, d00d! 1t'5 3v3rywh3r3. Try the L33t-5p34K Generator!!!### at 
http://www.geocities.com/mnstr_2000/translate.html 

but the link to the trusty mail archives was enough :) Thanks.

YA
--
Sailing is harder than flying. It's amazing that man learned how to sail
first. -- Burt Rutan.





RE: Off topic: Whut in tarnation is Unicode?

2002-01-15 Thread Yves Arrouye

Now if someone could resend this elite-speak converter link, it was great.
Please...

Thanks!
YA
--
Sailing is harder than flying. It's amazing that man learned how to sail
first. -- Burt Rutan.





RE: C with bar for with

2001-12-02 Thread Yves Arrouye

It may even be a glyph variant of the w with forward slash...
YA

 -Original Message-
 From: Stefan Persson [mailto:[EMAIL PROTECTED]]
 Sent: Sunday, December 02, 2001 3:19 AM
 To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
 Subject: Re: C with bar for with
 
 - Original Message -
 From: [EMAIL PROTECTED]
 To: [EMAIL PROTECTED]
 Sent: den 2 december 2001 02:16
 Subject: C with bar for with
 
 
  Someone said that in English, c-with-underbar means with. My mom
 writes
 this as c-with-overline.
 
 Well, then I suppose this is a glyph variant of the c with underbar...
 
 Stefan
 
 
 _
 Do You Yahoo!?
 Get your free @yahoo.com address at http://mail.yahoo.com





RE: Character encoding at the prompt

2001-10-25 Thread Yves Arrouye

 But:
 setenv LC_ALL en_US.UTF-8
 env LC_ALL=it echo
 giovedì, 25 ottobre 2001, 11:45:24 EDT
 
 I could not understand why I get the display of the letter ì in the
 en_US.UTF-8 Locale. My understanding was that the date command was
 generating the message in the Italian locale (default encoding iso-8859-1)
 and as a result ì would be encoded as xEC. The display should be done in
 the
 en_US.UTF-8 Locale and be an invalid byte sequence.

I think you're making an improper assumption about the fact that your
*terminal* is in UTF-8 and would then complain. Unless your terminal has
explicit support for UTF-8 I do not think it will validate things. And it
apparently has not been started from a process that was already using UTF-8,
since you're issuing your setenv LC_ALL en_US.UTF-8 at the prompt. This is
only affecting subsequent commands (unless overridden, of course, as in your
next call), not the running process.
 
YA

PS: not to mention zsh(1) would be a better shell ;-) just teasing





RE: normalize before map?

2001-10-04 Thread Yves Arrouye

[People were discussing whether one should do some case mappings before
doing normalization, or the other way, and whether the case mapping can be
naive or must account for what normalization will do/has done in order not
to break assumptions that the resulting string is both case-folded and
normalized. The normalization form used can be anything I believe, though in
the IETF context NFKC and NFC are the common ones.]

  My guess is that case folding by no means guarantee that 
 the output is
  still normalized.
 
 Right, if you fold and then normalize, your string might not 
 be properly
 folded anymore (which is why nameprep had to adjust the 
 mapping table).
 Similarly, if you normalize and then fold, your string might not be
 properly normalized anymore.  Either way, if you want a string to be
 both normalized and folded, you cannot naively apply normalization and
 case-folding (in either order), you need to tweak the mapping table to
 compensate for the interactions.  The sentence quoted from 
 UTR#21 above
 glosses over this problem.  The problem exists (and has a solution) no
 matter which order you use.
 
 Does Mark Davis (the author of UTR#21) subscribe to this 
 list?  It would
 probably be helpful to get his thoughts on the matter.

You can always Cc: [EMAIL PROTECTED] for such questions. Which I am doing
now. Of course, we don't want to Cc: them all the time...

YA




RE: Currency symbols (was RE: Shape of the US Dollar Sign)

2001-10-01 Thread Yves Arrouye


 About £ (L with two bars = Italian lira or Egypt/Cyprus pound) and
 £
 (L with one bar = Pound Sterling or Irish punt), I think that the
 Unicode distinction is not valid because:
 
 [...]

 For these reason, I suggest that font designers ignore the distinction
 between U+00A3 (POUND SIGN) and U+20A4 (LIRA SIGN) and use the same glyph
 for both.  The glyphs should have one or two bars depending on the font
 style and on the choice made for other currency symbols.

Interesting comment. Isn't the Unicode distinction simply one of characters,
and the difference in glyphs shown in the standard simply a reflection of
the preferences of the designer of the fonts used to print the character
tables? I'd think so.

YA





RE: DerivedAge.txt

2001-09-26 Thread Yves Arrouye

 At the request of someone working with ICU, I regenerated a derived file
 that shows the age of Unicode characters -- when they came into Unicode.
 Does anyone think this might be useful to have in the UCD?

It is definitely useful information that could go into UNIDATA. Here is a
good use for it (and my reason for asking Mark to regenerate it for me):
when one uses a library such as ICU that manipulates 3.1 data but want to
store some data in a database that won't like anything after 2.x. Using
this, one can validate data before sending them to the database as needed.

It doesn't necessarily have to get into the UCD, except if it helps me make
a smaller change to ICU to support the version as a character property ;-)

YA





RE: 3rd-party cross-platform UTF-8 support

2001-09-21 Thread Yves Arrouye

  UTF-16 - wchar_t*
 
 Wait be careful here. wchar_t is not an encoding. So.. in 
 theory, you cannot convert between UTF-16 and wchar_t. You,
 however, can convert between UTF-16 and wchar_t* ON win32 
 since microsoft declare UTF-16 as the encoding for wchar_t.

And he can also do some between UTF-16 and UTF-32 for glibc-based programs
since UTF-32 is the encoding for wchar_t for such platforms.

The way I read that was UTF-16 - UTF-(8*sizeof(wchar_t)). (Please don't
ask what happens when sizeof(wchar_t) is 3 or larger than 4, you know what I
mean :)). I guess the responsibility of this being a meaningful conversion
would be with the caller.

YA

PS: I don't know a way of knowing the encoding of wchar_t programmatically.
Is there one? That'd offer some interesting possibilities..




RE: UTF-8 on NT

2001-09-04 Thread Yves Arrouye



 I'm also thinking of 3rd party 
UTF-8 support such as libutf8, IBM ICU. 
They seem no good supports on NT, what do you think 
?We are usingICU for all our Unicode 
needs,on NT, Windows 2000, and Unix, and itworks perfectlywell 
on all of these.
YA



How are the UNIDATA derived files generated?

2001-08-29 Thread Yves Arrouye

Hi,

I would like to know how the derived files that one can find in the UNIDATA
folder are generated? I am trying to have IBM's ICU library support older
versions of Unicode than the one it currently supports (3.0.something),
specifically Unicode 2.1.x.

ICU needs the following files:

UnicodeData.txt
SpecialCasing.txt
DerivedNormalizationProperties.txt
NormalizationTest.txt
UCARules.txt
FractionalUCA.txt
CaseFolding.txt
Mirror.txt

If I look in Public/2.1-Update4 I can find the first two files for Unicode
2.1.9.

A number of the other files either say they have been algorithmically
generated (e.g. DerivedNormalizationProperties.txt) or look like they have.
I am interested in knowing what tools have been used to generated these and
if I could get these tools and use them to generated the same files for
another version of Unicode. I am sure I could write some tools myself
(following the instructions in DerivedProperties.html for
DerivedNormalizationProperties.txt for example) but I am looking for a
quicker way to generate these.

Thanks for any help on this,
YA

PS: Also I hope that all the derived files will be stored in the non-UNIDATA
folders as Unicode is revised. They'll be helpful for people that need to
build a Unicode library for a very specific version of Unicode.
--
My opinions do not necessarily reflect my company's.
The opposite is also true..




RE: Locale codes (WAS: RE: RTF language codes)

2001-07-27 Thread Yves Arrouye

 On Thu, Jul 26, 2001 at 01:04:29AM -0700, Yves Arrouye wrote:
   If you have a cross platform system you should use RFC 1766 
   style locales
   between systems and convert them to LCIDs on Windows.
  
  RFC 3066 was published in January. Check it out.
  http://www.ietf.org/rfc/rfc3066.txt
 
 Note that neither RFC 1766 nor RFC 3066 refer to locales;
 they just define language identification tags.

Yes, I should have made that correction in my reply. These tags, or some
variations of these tags (e.g. replacing the hyphen by an underscore) can be
found as locale identifiers in many systems, I think that's what Carl was
referring to (e.g. use en_US).

I am not sure, and don't think, that the use of en_US on Unix/POSIX is
actually related to these RFCs. Does anybody know for sure?

YA




Locale codes (WAS: RE: RTF language codes)

2001-07-26 Thread Yves Arrouye

 If you have a cross platform system you should use RFC 1766 
 style locales
 between systems and convert them to LCIDs on Windows.

RFC 3066 was published in January. Check it out.
http://www.ietf.org/rfc/rfc3066.txt

YA




RE: Ethnologue 14 online

2001-07-24 Thread Yves Arrouye

 After considerable and unfortunate delay, the new Ethnologue site,
 including the online version of the 14th Edition, is at last 
 available to
 the public: http://www.ethnologue.com/home.asp. There are 
 still refinements
 being made, but all the basics are there and working.

Very nice! Something to get lost into for hours...
YA




RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-21 Thread Yves Arrouye

   SCSU doesn't look very nice for me. The idea is OK but it's just
   too complicated. Various proposals of encodings differences or xors
   between consecutive characters are IMHO technically better: much
   simpler to implement and work as well.
 
 These differential schemes seem to be the way IDN 
 (internationalized domain 
 names) are headed.  They are intended for the limited scope 
 of domain names 
 that have already passed through nameprep, which performs 
 normalization and 
 further limits the range of allowable characters.  I'm not 
 sure how well the 
 ACEs would perform with arbitrary Unicode text.  I suppose 
 only testing would 
 answer that question.

Also don't forget they're likely to add some code point reordering. Do we
want that too in an alternate scheme? Then is it really that much simpler
than SCSU? (Probably; tables for code point reordering are not complex to
build. But they may take some effort to optimize, so my guess is the
implementation effort may be roughly the same.)

YA




RE: More about SCSU (was: Re: A UTF-8 based News Service)

2001-07-13 Thread Yves Arrouye

 SCSU is also registered as an IANA charset, although you are 
 unlikely to find 
 raw SCSU text on the Internet, due to its use of control 
 characters (bytes 
 below 0x20).

And what browser supports SCSU, and what it that browser's reach in term of
population? Because that's usually what matters to people that publish on
the Internet.

YA
 




RE: Playing with Unicode (was: Re: UTF-17)

2001-06-25 Thread Yves Arrouye

 A proposal needs a definition, though:
 
  UTF would mean Unicode Transformation Format
  utf would mean Unicode Terrible Farce
 
 untenable total figment?

unable to focus?
utf twisted form?
 
YA




RE: UTF-17

2001-06-25 Thread Yves Arrouye


 From: [EMAIL PROTECTED]
 
  Oh yeah, well, I can be more tongue-in-cheek than all of you.  I've
 already
  implemented it.

Quick, quick. Patent it and then open-source it. It will be unstoppable.
YA




RE: UTF-17

2001-06-22 Thread Yves Arrouye

Isn't UTF-17 just a sarcastic comment on all of this UTF- discussion?
YA
 




RE: converting ISO 8859-1 character set text to ASCII (128)charactet set

2001-06-20 Thread Yves Arrouye

 We have a specific requirment of converting Latin -1 
 character set ( iso 
 8859-1 ) text  to ASCII charactet set ( a set of only 128 
 characters). Is 
 there any special set of utilities available or service 
 providers who can do 
 that type of job.

[I am assuming that your ascii table is the ASCII everybody use, not some
variation of it.]

If you do not care about the loss of information at all, just truncate the
data to 7 bits. You can write a trivially simple program for that, or use
your platform's conversion tools or routines (cf. iconv(1) and iconv(3) on
UNIX 98 platforms, uconv from ICU's contributed applications at
http://oss.software.ibm.com/icu/, or the WIN32 conversion APIs whose name I
forgot).

If you want to minimize the loss, you may want to use fallbacks so that for
example you will lose diacritics on letters but will retain the base letter.
Giving you things like mon bebe a tete tout l'ete for French. I am sure
the WIN32 APIs will let you do that, iconv doesn't support it, and I am not
sure about whether the ICU ASCII converter has fallbacks (some of their
converters do, some don't; but thus may be outdated info).

Hope this helps,
YA




RE: UTFs, ACEs, and English horns

2001-06-18 Thread Yves Arrouye

 Also check out the sites of the IETF IDN WG
(http://www.ietf.org/html.charters/idn-
 charter.html, and http://www.i-d-n.net/) for more information that you may
have
 wished for.

Oops. Sorry, I only saw James's answer. You obviously read these. Well, I
hope my English horns pages were new reading at least...

YA




RE: UTFs, ACEs, and English horns

2001-06-18 Thread Yves Arrouye

Also check out the sites of the IETF IDN WG
(http://www.ietf.org/html.charters/idn-charter.html, and
http://www.i-d-n.net/) for more information that you may have wished for.
Except on English horns, that is; but then you may want to visit
http://www.users.globalnet.co.uk/~gbrowne/geoff9.htm and
http://www.mathcs.duq.edu/~iben/oboeng.htm :).

Good luck,
YA




RE: Missing characters for Italian

2001-06-11 Thread Yves Arrouye

 So my question is: is the superscript attribute essential in French to
 understand these abbreviations (as it is in Italian), or is 
 it desirable but
 optional (as it is in English)?

Not to understand them. While understanding is subjective, it is usually
evident from the context that these are abbreviations, and which ones they
are. I wouldn't wish them to be encoded as characters myself. Displaying
them properly is what typography is for.

YA




RE: Term Asian is not used properly on Computers and NET

2001-06-03 Thread Yves Arrouye

 There are also terms like the West or Western (world, languages,
civilization, etc) which have referents that are not completely west of
the Greenwich Meridian, whose usage cannot be simply explained or
justified by it.

Every point can be found west (or east) of the Greenwhich Meridian. Not all
of them have west or east longitudes, though.

YA




RE: Metafont [was Re: Single Unicode Font]

2001-05-26 Thread Yves Arrouye

 BTW, it seems that Metafont is a trademark of Addison Wesley 
 publishing
 company ...

Interesting. Maybe because they published the Metafont book (and its friend
Metafont: the program) along with the rest of Knuth's Computers and
Typesetting books? This is the bell that Metafont (as you capitalized it)
rings for me. See http://www.math.utah.edu/~beebe/fonts/metafont.html and
http://cgm.cs.mcgill.ca/~luc/metafont.html.

YA




RE: search ignoring diacritics

2001-05-21 Thread Yves Arrouye

 Peter - normalise both data and search string - delete / 
 ignore all
 Peter characters with general category Mn

It worked well for us too. Someone mentionned to me once though that U+3099
and U+309A should be preserved in order not to change the meaning of words,
and we do so. But maybe this is not necessary?

YA




RE: About Kana folding

2001-05-18 Thread Yves Arrouye

Kenneth,

Thanks for the explanations.

 So I'd suggest you be very careful when trying to do this kind of
 a folding. If it is just for surface text matching, the number of
 false positive matches would likely swamp the number of false
 negatives you'd be correcting.
 
 On the other hand, if you are doing a phonetic matching, then 
 of course
 you have to fold the Hiragana and Katakana forms together.

I am trying to work around a situation where people cannot register a
database key in Katakana and the same one in Hiragana (because the DB's
collation does some Kana folding), yet they need to be able to find it using
either of these (after this key has been migrated to some other system that
doesn't do Kana folding). I don't know if that's what you call surface text
matching. The matching will be done on the whole key, not using N-grams.

 The more serious problem of equivalencing for matching in Japanese
 would be kanji versus Hiragana, in particular. [...] Getting this kind of
thing
 right is far more important for matching in Japanese than just
 brute matching of Hiragana to Katakana.

And if one wanted to do that automatically (which is not my intent, Kanji
work fine), one would need a dictionary to go from words in Kanji to one
Kana, is that true?

YA




About Kana folding

2001-05-17 Thread Yves Arrouye

Hi,

If one were to need to pick Katakana versus Hiragana and fold one into the
other (say to let people match a word or sentence in any of them), is there
one that is preferrable to the other? I think that some Katakana have no
Hiragana equivalents, does that mean that it's always easier to go from
Hiragana to Katakana? Also, what are the caveats of doing such foldings (and
is it possible to change meanings?)

Thanks!
YA
--
My opinions do not necessarily reflect my company's.
The opposite is also true..




RE: Help in a HURRY !!!!!!!!!!!!!!!!!!!!!!!

2001-05-15 Thread Yves Arrouye

To go with Lukas's Perl code, I'll provide a C version, not really tested
either, with ICU, to give him a choice. No error checking etc., just to give
the idea. If you want UTF-16 you'll need to use the macros in
unicode/utf16.h to generate surrogate pairs properly.

#include stdio.h
#include unicode/utf8.h

#define LINE_MAX 80 /* Whatever. */

int main() {
char buf[LINE_MAX];

while (fgets(buf, sizeof(buf), stdin)) {
  int i;
  size_t len = strlen(buf);

  if (buf[len - 1] == '\n') {
  buf[--len] = 0;   /* We don't want that one in
the output. */
  }

  for (i= 0; i  len;) {
int32_t c;

UTF8_NEXT_CHAR_UNSAFE(buf, i, c);
printf(c  0x80U ? %c : #%ld;, c); /* As Lukas's code,
use entities only above ASCII. */
  }
  putchar('\n');/* Separate lines; will produce white space
in HTML. */
}
}

Hope this helps,
YA




RE: UCD in XML

2001-05-15 Thread Yves Arrouye

 I then tried my usual remedy: Bow in precisely the correct 
 direction (359° 16' 32 N*)

Adjust the bearing for declination (15° 26' E according to my chart of the
bay), and try again compass in hand, maybe? ;-)

YA




RE: Using hex numbers considered a geek attitude

2001-05-03 Thread Yves Arrouye

BTW, anybody knows how to input characters on Windows using the hex
codepoint? I know it's good for my brain to do the exercise of going from
hexadecimal to decimal, but it is still a pain to have to type ALT-DECIMAL
when all I have in my book is hex. That would be a reason for providing the
decimal value (not in the tables, but in the properties pages that follow
maybe), actually. But I am sure there must be a way to input the hex
directly. Please?

YA




RE: Byte Order Marks

2001-04-20 Thread Yves Arrouye

  Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
  UTF16_BigEndian?
 
 ICU does not do Unicode-signature or other encoding detection 
 as part of a converter. When you get text from some protocol, 
 you need to instantiate a converter according to what you 
 know about the encoding.

So I can't pass it some text with a BOM and say "utf-16" and let it run
through that. I guess that explains why I also didn't find converters that
write a BOM at the start of the conversion. Is that something that would
added to ICU in the future? It would be very nice to have a converter that
would pick the BOM (and write it back).

And yes, most of the time, when you stay on a given platform, it is very
convenient to use the platform's endianness. My question was rather "why
isn't UTF-16 the one that detects the BOM and defaults to an externalized
form, BE, and then people on a given platform would just use UTF-16PE (of
which UTF-16 is an alias in ICU)?". That would facilitate interchange of
information.

YA




RE: Byte Order Marks

2001-04-20 Thread Yves Arrouye


 On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote:
  On the other hand, if you get a file from your platform and 
 it is in 16-bit Unicode, then you would appreciate the 
 convenience of the auto-endian alias.
 
 But nothing should be spitting out platform-endian UTF-16! In the
 case that there's a lot of unmarked big-endian UTF-16 around (as I
 understand the ISO-10646 standard recommends), then that assumption
 that everything emits unmaked platform-dependent UTF-16 will be
 wrong.

And for reference, on Windows, Unicode files are recognized because they
have a BOM. Write plain UTF-16LE w/o a BOM, and your file won't be
recognized properly. Manipulation of these files w/ ICU today is a bit
painful, since one needs to strip the BOM on input (if I understand Markus
correctly) and write a BOM at output. So these cannot be manipulated using
applications like uconv which blindly uses the raw converters.

YA




RE: Byte Order Marks

2001-04-19 Thread Yves Arrouye

 If you don't have any clue about the byte order, but you know it is
UTF-16, then assume BE.

Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
UTF16_BigEndian? I know that was a difference between ICU and my library,
and when I asked this question a while ago I was told that despite what some
litterature suggests, w/o any clue, platform endianness should be used.
That's contradictory.

YA




RE: How will software source code represent 21 bit unicode charac ters?

2001-04-17 Thread Yves Arrouye

  Has this matter already been addressed anywhere?
 
 I think the C standard is in the process of making a decision 
 about this. If
 memory helps, we will have escapes like '\u' and '\U'.

I think they made the decision already. It is in the latest editions of the
standards. The only ambiguity (for me) is whether one can write:

uint32_t codepoint = '\U001';

and have it work, or if there's some implicit assumption that '\U001' is
of type wchar_t, in which case the construction is not portable because of
the fact that the size of wchar_t is implementation-specific, and can be as
small as 8 bits. I am sure we have a C/C++ expert (or many!) here that can
clear that up though.

YA




RE: Identifiers

2001-04-16 Thread Yves Arrouye

  On Sun, Apr 15, 2001 at 08:10:55PM +0200, Florian Weimer wrote:
   Is it sufficient to mandate that all such identifiers 
 MUST be KC- or
   KD-normalized?  Does this guarantee print-and-enter round-trip
   compatibility?
  
  In general, the problem is unsolvable. There are several look-alikes
  among the Cyrillic, Greek, Latin and Cherokee blocks, among others. 
 
 And those are not equivalent under normalization?  That's a pity.

But that is not the goal of the Unicode normalization! (Reas UAX #15,
http://www.unicode.org/unicode/reports/tr15/). Which is to be expected, from
a standard about characters, anf not glyphs.

The normalization you are talking about seems to me to be one that is
glyph-centric: you're looking at shapes and are wanting to avoid confusions
by making similar-looking things the same. We have normalization similar to
the one you're talking about in our Internet Keywords system. It is built on
top of NFKC. It is good for users, but then it is also very specific. For
example, we didn't consider the look-alikes aming Cyrillic, Greek, and Latin
to be a problem for our users, but your comment about that being a pity
seems to imply that you would. I think such normalizations depend a lot
about who is going to need the names and in what context. It'll be very hard
to make a general recommendation that isn't too restrictive for many.

YA
  




RE: Identifiers

2001-04-16 Thread Yves Arrouye

 (I don't know if email addresses will be internationalized anytime
 soon.  This is just an example. ;-)

http://www.-i-d-n.net/

They have a normalization process that may be used for e-mail someday. It
explictely does not do anything about similar looking glyphs. Read their
list archive, I'm sure the reason why has been discussed there. That may
give you ideas for what you're trying to achieve.

YA




RE: Identifiers

2001-04-16 Thread Yves Arrouye

 There should be a method to overcome the source sepearation rule which
 might have saved certain identical characters from unification.
 
-  U+0048 LATIN CAPITAL LETTER H
-  U+0397 GREEK CAPITAL LETTER ETA
-  U+041D CYRILLIC CAPITAL LETTER EN
-  U+13BB CHEROKEE LETTER MI
 
 If this were Han glyphs, they would have been unified, 
 wouldn't they? ;-)

Florian, I respectfully suggest that you look up the various technical
reports that accompany the Unicode standard. It looks like ther may be
certain confusion about characters and glyphs in respect with the Unicode
standard (which tackles characters, not glyphs; Han *characters* were
unified, and they were in a single *script*). UTR #17
(http://www.unicode.org/unicode/reports/tr17/) should definitely be useful.
See section 2.1 for instance.

Hope this helps,
YA




RE: Identifiers

2001-04-16 Thread Yves Arrouye

 Florian, I respectfully suggest that you look up the various technical
 reports that accompany the Unicode standard. It looks like ther may be
 certain confusion about characters and glyphs

Oops, got tripped by my native French language. I didn't mean "certain" but
"some". Do not conclude that I jump to conclusions that easily :).

YA




RE: Identifiers

2001-04-16 Thread Yves Arrouye

  We have normalization similar to
  the one you're talking about in our Internet Keywords 
 system. It is built on
  top of NFKC. It is good for users, but then it is also very 
 specific.
 
 Details, details!  (Or do you consider that stuff a proprietary
 advantage?)

I don't really. That would be too fragile of an advantage to build on. But
as my signature shows, I may be mistaken :)

For a year-old explanation of the use of Unicode in our system, from the
16th IUC, see http://www.internetkeywords.org/iuc/realnames-iuc16-paper.htm.

Basically, we have two normalization forms. The first one is only for
presentation, and that is a very lightweight cleanup (remove invisible
characters, compress whitespace runs, map half-width characters to
full-width ones...). The second one is used to define uniqueness and that is
more restrictive; it builds on the cleaned up form. We do the following:

- Put the string in NFKC.
- Put the string in lowercase of its uppercase.
- Map some characters to take into account alternate spelling (German, for
example; when there is a conflicting between languages, oops).
- Undo some ligatures that KC didn't undo (as in French "qui vole un oeuf
vole un boeuf").
- Map some characters that are visually very similar to their lowest common
denominator (ASCII) counterpart. For example, the prime and fancy
apostrophes (sorry, don't feel like fetching my Unicode book to get their
proper names) are considered the same as a vanilla apostrophe.

That's about it. We're considering doing new things regularly, and are/will
be also doing specific things to overcome limitations of our distributions
channels (for example, Kana mapping).

As I've said, it's specific to the user experience we want to present to
users of Keywords (fancy display, simpler input). There are obvious
limitations, and each time we start getting a fair number of names in a
given language, I look at these again, and try to do the "right thing"
(fortunately, this is a subjective and very adaptable notion ;-)). Any
pointers to problems that we may encounter, smart things to do, etc... are
of great interest to me, please send them!

YA
--
My opinions do not necessarily reflect my company's.
The opposite is also true.




RE: Identifiers

2001-04-15 Thread Yves Arrouye


 Is it sufficient to mandate that all such identifiers MUST be KC- or
 KD-normalized?  Does this guarantee print-and-enter round-trip
 compatibility?

It depends on the accuracy of both the printer or the reader. So I'd say no.
People won't necessarily mae the difference between a middle dot and some
bullets, for example. But do we always want every identifier to resist the
"napkin test"? Not necessarily. And IDN is an example where this was not
chosen, so internationalized e-mail addresses, as per today's IDN I.D.s,
won't have this guarantee. And remember that for most people or
organizations, the problem will be much simpler: they won't understand the
identifier, let alone make such fine distinctions. For example, even if you
print a high-resolution version of a Japanese e-mail address, chance is that
I won't be able to type it in anyway in any software (though I may be able
to recognize the glyphs and copy/paste them from a Japanese site. Ugh)...

YA




RE: Sun's Java encodings vs IANA's character set registry

2001-04-12 Thread Yves Arrouye

 I should not be surprised by your statement, but I am. It is distressing
to
think that something that by definition should not be rocket science --
repertoires of abstract characters mapped directly to specific bit patterns
-- would be subject to such haphazard definition and even more haphazard
implementation.

Backwards compatibility stroke. As vendors changed the mappings, they kept
the same names so that they would not have to update software to use the new
names. Typically the changes are thought to enhance the encoding, and people
want everybody to benefit (isn't that ironic?). Shift_JIS is my favorite
incompatible charset. And just think of things like putting the Euro sign in
a bunch of encodings w/o changing their names, or of when Windows-1252 was
advertised as iso-8859-1 for interoperability purposes... It's a dangerous
world ;)

YA




RE: Digits in Unicode Names

2001-04-06 Thread Yves Arrouye


 What would really be nice, is for glibc-2.2 or any other unicode enabled
 
 library to display unicode characters,etc by juts using the "escape" 
 sequence \u, where X represents a hexadecimal value..

Make that up to 6 Xs. One of the problems of such escapes when used in code,
a la ISO C++ (like the \ooo for octal digits) is that they're
variable-length, and stop as soon as an invalid char for the radix is
encountered. That makes them error-prone (but fun).

Does anybody know if the C++ standard specified how many hex digits max this
escape can have? And doesn't the standard say something like \u is for
wchar_t, which may not be Unicode (I hope I'm wrong here)?

YA





RE: locale files....

2001-03-30 Thread Yves Arrouye


 sorry. Intel platform running Redhat Linux 7.0..

Oops, and regarding your questions about locale files on Linux. They follow
the POSIX format and can easily be modified once you get them in source form
along with the localdef util.

YA




Re: UTF8 vs. Unicode (UTF16) in code

2001-03-09 Thread Yves Arrouye

  Since the U in UTF stands for Unicode, UTF-32 cannot
 represent more than
  what Unicode encodes, which is is 1+ million code points.
 Otherwise, you're
  talking about UCS-4. But I
  thought that one of the latest revs of ISO 10646
 explicitely specified that
  UCS-4 will never encode more than what Unicode can encode, and thus
  definitely these 4 billion characters you're alluding to.
 
 As far as I know the U in UTF stands for Universal - not unicode.
 ISO 10646 can encode characters beyond UTF-16, and should retain
 this capability. There is a proposal to restrict UTF-8 to
 only encompas the same values as UTF-16, but UCS-4 still encodes
 the 31-bit code space.

Page 12 of the Unicode Standard 3.0 says:

"UTF-8 (Unicode Transformation Format-8) [...]"

which is what I used to build my knowledge of what the U stands for. But
I may be wrong.

Thanks for clarifying my confusion between the proposal for restricting
UTF-8, not UCS-4.  So if the ISO never said that they will not encode
things beyond what Unicode can encode, and if UTF-8 is restricted, they
may someday need a UCSTF-8 (or whatever) to encode UCS-4, right? And the
only difference between UTF-8 and this UCSTF-8 may be the semantics of
what can be encoded and what is legal after decoding.

YA





RE: New Name Registry Using Unicode

2000-09-29 Thread Yves Arrouye

 The people doing this are www.xns.org and www.onename.com. 
 One needs to
 visit their sites and read their "white papers" to get a full 
 picture of
 what the purpose is and how they are using the standards.

Note that there are other naming initiatives, including the one driven by my
company, RealNames, which was presented at the 16th Unicode Conference. See
http://www.internetkeywords.org/iuc/realnames-iuc16-paper.htm which contains
both a paper and my presentation slides, linked from the Unicode Web site
(http://www.unicode.org/iuc/iuc16/papers.html).

Lastly, people interested in naming may want to check out CNRP, an IETF
protocol for the resolution of common names, at
http://www.ietf.org/html.charters/cnrp-charter.html.

YA



RE: Unicode in VFAT file system

2000-07-20 Thread Yves Arrouye

 Recently I've had the dubious pleasure of delving into the details of 
 the VFAT file system. For long file names, I thought it used UCS-2, 
 but in looking at the data with a disk editor, it appears to be 
 byte-swapping (little endian). I thought that UCS-2 was by definition 
 big endian, thus I've got the following questions:
 
 1. Could it be using UTF-16LE? I tried creating an entry with a 
 surrogate pair, but the name was displayed with two black boxes on a 
 Windows 2000-based computer, so I assumed that surrogates were not 
 supported.

It is UTF-16 (LE, because of Intel architecture), and AFAIK there are is no
surrogate support yet. Not that there would be anything to display, except
one box instead of two :)
 
YA