Hi Addison,

>UCS-2 is pretty close to the same thing as UTF-16. The differences do not
>apply here.
>
>UCS-2 can be big-endian or little-endian. The rule is that BE is the
>default. However, on Intel platforms, you shouldn't be surprised to see LE
>everywhere: that's the architecture. Microsoft is saving two bytes for
>every filename by not storing a BOM.

Thanks for the fast response. I was basing my understanding of UCS-2 
always being big-endian on Marcus Kuhn's prior email, which said:

At 2:58am -0800 00-02-18, Markus Kuhn wrote:
>Date: Fri, 18 Feb 2000 02:58:51 -0800 (PST)
>From: Markus Kuhn <[EMAIL PROTECTED]>
>Subject: Re: UCS-4, UCS-2, UTF-16, UTF-8
>To: Unicode List <[EMAIL PROTECTED]>
>X-UML-Sequence: 12380 (2000-02-18 10:58:53 GMT)
>
>Yung-Fong Tang wrote on 2000-02-17 21:18 UTC:
>  > UCS-4 does not specify byte order, but UTF-32BE and
>  > UTF-32LE does.
>
>No. UCS-2 and UCS-4 have always been bigendian. Read ISO 10646-1:1993,
>section "6.3 Octet order" (page 7):
>
>   When serialized as octets, a more significant octet shall
>   precede less significant octets.
>
>ISO and ITU have fortunately always frowned upon Intel's horrible 1970s
>decision of staying compatible with some obscure long-forgotten 1960s
>mainframe for which they had bought some software when they made the
>8080 a littleendian processor (Intel's microcontrollers by the way are
>all bigendian, as is pretty much anything else that was not designed to
>be Intel compatible).

So now I'm a bit confused, since I've never heard of UCS-2LE/UCS-2BE.

>You should note that Microsoft *means* UCS-2LE (and UTF-16LE in more
>modern systems) when they say "Unicode" (at least on Intel platforms).
>
>So:
>
>1. Yes, it is perfectly valid.
>2. There are no characters in the surrogate space just yet, so a black
>square should be no surprise. Two black squares means that it's being
>treated as UCS-2.

Does anybody know if Microsoft has publicly stated if/when they'll 
support surrogates in VFAT file names?

>3. Filenames are, by definition in Windows-land, UPPERCASE in Western
>European systems.

My understanding is that with DOS they were always upper-cased, but 
probably only for the Western European code pages. With VFAT, the 
file names are stored as-is, but checked for uniqueness using 
case-insensitivity (but only in the basic Latin and Latin-1 
supplement range).

>Other scripts either don't have the concept of case or
>weren't mucked with. This includes compatibility characters stored outside
>the U+0000 to U+00FF range.

OK - this matches the behavior I was seeing with Japanese Windows 
systems, where full-width Romaji isn't case-folded before checking 
file names.

Thanks,

-- Ken

>===========================================================
>Addison P. Phillips                    Principal Consultant
>Inter-Locale LLC                http://www.inter-locale.com
>Los Gatos, CA, USA          mailto:[EMAIL PROTECTED]
>
>+1 408.210.3569 (mobile)              +1 408.904.4762 (fax)
>===========================================================
>Globalization Engineering & Consulting Services
>
>On Thu, 20 Jul 2000, Ken Krugler wrote:
>
>  > Hi Unicoders,
>  >
>  > Recently I've had the dubious pleasure of delving into the details of
>  > the VFAT file system. For long file names, I thought it used UCS-2,
>  > but in looking at the data with a disk editor, it appears to be
>  > byte-swapping (little endian). I thought that UCS-2 was by definition
>  > big endian, thus I've got the following questions:
>  >
>  > 1. Could it be using UTF-16LE? I tried creating an entry with a
>  > surrogate pair, but the name was displayed with two black boxes on a
>  > Windows 2000-based computer, so I assumed that surrogates were not
>  > supported.
>  >
>  > 2. Is little-endian UCS-2 a valid encoding that I just don't know about?
>  >
>  > 3. And finally, why are file names case-insensitive for characters in
>  > the U-0000 to U-00FF range, but not for any other characters? OK,
>  > maybe I can guess at the answer to that one...
>  >
>  > Thanks,
>  >
>  > -- Ken
>  > Ken Krugler
>  > TransPac Software, Inc.
>  > <http://www.transpac.com>
>  > +1 530-470-9200
>  >

Ken Krugler
TransPac Software, Inc.
<http://www.transpac.com>
+1 530-470-9200

Reply via email to