Re: Unicode in VFAT file system

2000-07-26 Thread addison

Actually, the problem is the *same old thing*: no education about I18N
issues in general. There are all sorts of interesting "biases" about
Unicode related to the still lamentable level of I18N training that the
average developer receives.

It's simply shocking.

Best regards,

Addison

On Wed, 26 Jul 2000, Michael (michka) Kaplan wrote:

> Ah, I did know that! :-)
> 
> The place I find a UTF-8 bias most often is in people doing web content and
> people working with Oracle. Good to know there are others with UCS-2/UTF-16
> biases!
> 
> And of course its even better when we get them to accept that they are
> mainly different ways of expressing the same thing.
> 
> michka
> 




errata! Re: Unicode in VFAT file system

2000-07-26 Thread Michael \(michka\) Kaplan

> Ah, I did know that! :-)

Oops, I meant "Ah, I did *not* know that!

Mea culpa.

michka




Re: Unicode in VFAT file system

2000-07-26 Thread addison

That's not true. Even serious UNIX shops start with the perception that
there is only one Unicode and that it is 16-bits (UCS-2). I get this *all*
the time.

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering & Consulting Services

On Thu, 20 Jul 2000, Michael (michka) Kaplan wrote:

> Although there is some truth here the fact is that it is not really true
> today that everyone equates the two. The default thought on people's minds
> these days when they think of Unicode is UTF-8, it seems like. And this is
> mainly due to applications of Unicode to the web, I think.
> 
> In the meantime, Microsoft is still pretty firmly rooted in the idea that
> Unicode=USC-2 (or UTF-16le on Windows 2000). UTF-8 is named UTF-8 and
> considered to be a multibyte encoding.
> 
> michka
> 
> 
> - Original Message -
> From: "Doug Ewell" <[EMAIL PROTECTED]>
> To: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Thursday, July 20, 2000 10:41 PM
> Subject: Re: Unicode in VFAT file system
> 
> 
> > Addison Phillips <[EMAIL PROTECTED]> wrote:
> >
> > > Avoiding for the moment the word-parsing that Markus suggests, Unicode
> > > on Microsoft platforms has always been LE (at least on Intel) and they
> > > have called the encoding they use "UCS-2" (when they bothered with
> > > such things: in the past they always called it "Unicode" as if it were
> > > the *only* encoding). As Unicode has evolved, Microsoft products have
> > > become more exact in this regard.
> >
> > I remember that in the early to mid '90s, before the invention (or at
> > least widespread use) of UTF-8, UTF-32, and surrogates, *everybody* --
> > not just Microsoft -- used the term "Unicode" to refer to what we would
> > now call UCS-2.  Even the Unicode Consortium did this!  And even now,
> > the few of my co-workers who know about Unicode (I'm trying to spread
> > the word, folks, honest) think a "Unicode text file" is UCS-2 by
> > definition.  I don't know what they would think of a UTF-8 file --
> > nobody but me is knowingly using them yet.  In any case, this usage is
> > by no means confined to Microsoft.
> >
> > -Doug Ewell
> >  Fullerton, California
> >
> 
> 




Re: Unicode in VFAT file system

2000-07-26 Thread Michael \(michka\) Kaplan

Ah, I did know that! :-)

The place I find a UTF-8 bias most often is in people doing web content and
people working with Oracle. Good to know there are others with UCS-2/UTF-16
biases!

And of course its even better when we get them to accept that they are
mainly different ways of expressing the same thing.

michka


- Original Message -
From: <[EMAIL PROTECTED]>
To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
Cc: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, July 27, 2000 1:56 AM
Subject: Re: Unicode in VFAT file system


> That's not true. Even serious UNIX shops start with the perception that
> there is only one Unicode and that it is 16-bits (UCS-2). I get this *all*
> the time.
>
> Addison
>
> ===
> Addison P. PhillipsPrincipal Consultant
> Inter-Locale LLChttp://www.inter-locale.com
> Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]
>
> +1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
> ===
> Globalization Engineering & Consulting Services
>
> On Thu, 20 Jul 2000, Michael (michka) Kaplan wrote:
>
> > Although there is some truth here the fact is that it is not really
true
> > today that everyone equates the two. The default thought on people's
minds
> > these days when they think of Unicode is UTF-8, it seems like. And this
is
> > mainly due to applications of Unicode to the web, I think.
> >
> > In the meantime, Microsoft is still pretty firmly rooted in the idea
that
> > Unicode=USC-2 (or UTF-16le on Windows 2000). UTF-8 is named UTF-8 and
> > considered to be a multibyte encoding.
> >
> > michka
> >
> >
> > ----- Original Message -
> > From: "Doug Ewell" <[EMAIL PROTECTED]>
> > To: "Unicode List" <[EMAIL PROTECTED]>
> > Sent: Thursday, July 20, 2000 10:41 PM
> > Subject: Re: Unicode in VFAT file system
> >
> >
> > > Addison Phillips <[EMAIL PROTECTED]> wrote:
> > >
> > > > Avoiding for the moment the word-parsing that Markus suggests,
Unicode
> > > > on Microsoft platforms has always been LE (at least on Intel) and
they
> > > > have called the encoding they use "UCS-2" (when they bothered with
> > > > such things: in the past they always called it "Unicode" as if it
were
> > > > the *only* encoding). As Unicode has evolved, Microsoft products
have
> > > > become more exact in this regard.
> > >
> > > I remember that in the early to mid '90s, before the invention (or at
> > > least widespread use) of UTF-8, UTF-32, and surrogates, *everybody* --
> > > not just Microsoft -- used the term "Unicode" to refer to what we
would
> > > now call UCS-2.  Even the Unicode Consortium did this!  And even now,
> > > the few of my co-workers who know about Unicode (I'm trying to spread
> > > the word, folks, honest) think a "Unicode text file" is UCS-2 by
> > > definition.  I don't know what they would think of a UTF-8 file --
> > > nobody but me is knowingly using them yet.  In any case, this usage is
> > > by no means confined to Microsoft.
> > >
> > > -Doug Ewell
> > >  Fullerton, California
> > >
> >
> >
>
>




Re: Unicode in VFAT file system

2000-07-26 Thread addison

Well... there was only one Unicode in those days. But the vagueness
persisted after its time. This is fine in the consumer documentation,
where it really doesn't matter. But in the development docs it is a real
problem.

Of course, I understand that software development cycles, the size of the
Windows API, and other factors are also involved. And no, Microsoft isn't
alone in this. But it is important to point out to Windows developers
where the documentation is deficient. If we were talking about some other
system, I'd probably have similar comments to make (or worse comments: at
least MS went to the considerable difficulty of building Unicode support
into NT from the get-go).

Regards,

Addison

On Fri, 21 Jul 2000, Doug Ewell wrote:

> Addison Phillips <[EMAIL PROTECTED]> wrote:
> 
> > Avoiding for the moment the word-parsing that Markus suggests, Unicode
> > on Microsoft platforms has always been LE (at least on Intel) and they
> > have called the encoding they use "UCS-2" (when they bothered with
> > such things: in the past they always called it "Unicode" as if it were
> > the *only* encoding). As Unicode has evolved, Microsoft products have
> > become more exact in this regard.
> 
> I remember that in the early to mid '90s, before the invention (or at
> least widespread use) of UTF-8, UTF-32, and surrogates, *everybody* --
> not just Microsoft -- used the term "Unicode" to refer to what we would
> now call UCS-2.  Even the Unicode Consortium did this!  And even now,
> the few of my co-workers who know about Unicode (I'm trying to spread
> the word, folks, honest) think a "Unicode text file" is UCS-2 by
> definition.  I don't know what they would think of a UTF-8 file --
> nobody but me is knowingly using them yet.  In any case, this usage is
> by no means confined to Microsoft.
> 
> -Doug Ewell
>  Fullerton, California
> 




RE: Unicode in VFAT file system

2000-07-21 Thread Jonathan Rosenne

I apologize.

Jony

> -Original Message-
> From: Becker [mailto:Becker]
> Sent: Friday, July 21, 2000 10:34 PM
> To: Unicode List
> Cc: Myself
> Subject: RE: Unicode in VFAT file system
> 
> 
> 
> Jony Rosenne, who has been a great contributor since or before the
> beginning, wrote in an off moment:
> 
> > UTF-8 is a biased transformation format designed to save American and
> > Western Europeans storage space and to give some people a warm 
> feeling by
> > keeping Unicode in the familiar 8 bit world.
> 
> FYI, below are the design goals of UTF-8 as specified by its originators,
> Ken Thompson et al @ ATT.
> 
> Joe
> 
> 




Re: Unicode in VFAT file system

2000-07-21 Thread Peter_Constable


On 07/21/2000 12:55:59 PM <[EMAIL PROTECTED]> wrote:

>The problem is that the labels where invented to tag data streams, not to
>'label' the result of autodetection. As you point out there are 4 results
of
>auto-detection:
>
>UTF-16, no BOM
>UTF-16, no BOM, but arriving in reverse byte order (for my processor)
>UTF-16 with BOM
>UTF-16 with BOM, arriving in reverse byte order (for my processor)
>
>When I send a data stream, I have these conditions
>
>1) don't know byte order
>a) send it out bare
>b) send it out with BOM
>
>2) do know byte order
>a) send it out with BOM, but don't tell recipient the byte order
>b) don't use bom, and TELL RECIPIENT the byte order in an external LABEL
>
>LABELS UTF-16BE and UTF-16LE are to be used for case 2b *only*.
>LABEL UTF-16 is required for 1a and b and 2a.
>
>The hypothetical case of TELLING the recipient the byte order *and* using
the
>BOM at the same time is not supported.

(EMPHASIS added - PC)

Now, this is precisely my point! These terms are what we *tell* a recipient
about our data, not something directly about the data themselves. The
explanations all sound like their about the data themselves, however, and
that makes things confusing.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: Unicode in VFAT file system

2000-07-21 Thread Becker, Joseph


Jony Rosenne, who has been a great contributor since or before the
beginning, wrote in an off moment:

> UTF-8 is a biased transformation format designed to save American and
> Western Europeans storage space and to give some people a warm feeling by
> keeping Unicode in the familiar 8 bit world.

FYI, below are the design goals of UTF-8 as specified by its originators,
Ken Thompson et al @ ATT.

Joe


---
From: [EMAIL PROTECTED]
Date: Tue, 8 Sep 92 03:22:07 EDT
To: [EMAIL PROTECTED]
Subject: (XoJIG 620) 

Here is our modified FSS-UTF proposal.  The words are the same as on the
previous proposal.  My apologies to the author.  The code has been tested to
some degree and should be pretty good shape.  We have converted Plan 9 to
use this encoding and are about to issue a distribution to an initial set of
university users.

File System Safe Universal Character Set Transformation Format (FSS-UTF)
--

With the approval of ISO/IEC 10646 (Unicode) as an international standard
and the anticipated wide spread use of this universal coded character set
(UCS), it is necessary for historically ASCII based operating systems to
devise ways to cope with representation and handling of the large number of
characters that are possible to be encoded by this new standard.

There are several challenges presented by UCS which must be dealt with by
historical operating systems and the C-language programming environment.
The most significant of these challenges is the encoding scheme used by UCS.
More precisely, the challenge is the marrying of the UCS standard with
existing programming languages and existing operating systems and utilities.

The challenges of the programming languages and the UCS standard are being
dealt with by other activities in the industry.  However, we are still faced
with the handling of UCS by historical operating systems and utilities.
Prominent among the operating system UCS handling concerns is the
representation of the data within the file system.  An underlying assumption
is that there is an absolute requirement to maintain the existing operating
system software investment while at the same time taking advantage of the
use the large number of characters provided by the UCS.

UCS provides the capability to encode multi-lingual text within a single
coded character set.  However, UCS and its UTF variant do not protect null
bytes and/or the ASCII slash ("/") making these character encodings
incompatible with existing Unix implementations.  The following proposal
provides a Unix compatible transformation format of UCS such that Unix
systems can support multi-lingual text in a single encoding.  This
transformation format encoding is intended to be used as a file code.  This
transformation format encoding of UCS is intended as an intermediate step
towards full UCS support.  However, since nearly all Unix implementations
face the same obstacles in supporting UCS, this proposal is intended to
provide a common and compatible encoding during this transition stage.


Goal/Objective
--

With the assumption that most, if not all, of the issues surrounding the
handling and storing of UCS in historical operating system file systems are
understood, the objective is to define a UCS transformation format which
also meets the requirement of being usable on a historical operating system
file system in a non-disruptive manner.  The intent is that UCS will be the
process code for the transformation format, which is usable as a file code.

Criteria for the Transformation Format
--

Below are the guidelines that were used in defining the UCS transformation
format:

1) Compatibility with historical file systems:

Historical file systems disallow the null byte and the ASCII slash
character as a part of the file name.

2) Compatibility with existing programs:

The existing model for multibyte processing is that ASCII does not
occur anywhere in a multibyte encoding.  There should be no ASCII code
values for any part of a transformation format representation of a character
that was not in the ASCII character set in the UCS representation of the
character.

3) Ease of conversion from/to UCS.

4) The first byte should indicate the number of bytes to follow in a
multibyte sequence.

5) The transformation format should not be extravagant in terms of
number of bytes used for encoding.

6) It should be possible to find the start of a character
efficiently starting from an arbitrary location in a byte stream.


Proposed FSS-UTF


The proposed UCS transformation format encodes UCS values in the range
[0,0x7fff] using multibyte characters of lengths 1, 2, 3, 4, 5, and 6
bytes.  For all encodings of more than one byte, the initial byte determines
the number of bytes used and the high-order bit in each byte is set. 

Re: Unicode in VFAT file system

2000-07-21 Thread Asmus Freytag

At 07:14 AM 7/21/00 -0800, [EMAIL PROTECTED] wrote:
>Why does it say there are three varieties when a 16-bit datum can only be
>serialised in two orders? If the scheme UTF-16 doesn't have a BOM, isn't it
>just one of the other two? When it does have a BOM, it can still be
>serialised in two ways, so aren't there four schemes - 2 serialisations x
>±BOM? I barely manage to make sense of forms and schemes and then they
>confuse me with this stuff!

The problem is that the labels where invented to tag data streams, not to 
'label' the result of autodetection. As you point out there are 4 results 
of auto-detection:

UTF-16, no BOM
UTF-16, no BOM, but arriving in reverse byte order (for my processor)
UTF-16 with BOM
UTF-16 with BOM, arriving in reverse byte order (for my processor)

When I send a data stream, I have these conditions

1) don't know byte order
a) send it out bare
b) send it out with BOM

2) do know byte order
a) send it out with BOM, but don't tell recipient the byte order
b) don't use bom, and tell recipient the byte order in an external label

labels UTF-16BE and UTF-16LE are to be used for case 2b *only*.
label UTF-16 is required for 1a and b and 2a.

The hypothetical case of telling the recipient the byte order *and* using 
the BOM at the same time is not supported.

A./



Re: Unicode in VFAT file system

2000-07-21 Thread Peter_Constable



On 07/21/2000 01:24:02 PM <[EMAIL PROTECTED]> wrote:

>> Why does it say there are three varieties when a 16-bit datum can only
be >
>serialised in two orders?
>
>The simplest way to think about it is to remember that a MIME charset is
meant
>to provide *minimal* information for the receiver to convert bytes into
>characters.  If the receiver gets FF FE 01 02, then it *must* be
interpreted as
>follows depending on the charset...

I understand that these determine different interpretations of a stream in
those circumstances. But, the explanation "the encoding form UTF-16 has
three encoding schemes..." doesn't appear in the context of a discussion of
MIME charsets on anything like that; it's just a context-free statement. If
you read D33 and D34, then read D35


UTF-16 is the Unicode Transformation Format that serializes a Unicode value
as a sequence of two bytes, in either big-endian or little-endian format...


someone can easily be left thinking, "so it's one of the previous two, but
they seem to be saying it's a third, which doesn't make sense". That's
because the labels UTF-16, UTF-16BE and UTF-16LE aren't about what's
actually in the text stream but rather are about what is explicitly *said*
about what's in the text stream. Yet the definitions never make that clear.
That's my point.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: Unicode in VFAT file system

2000-07-21 Thread Asmus Freytag

At 04:58 AM 7/21/00 -0800, [EMAIL PROTECTED] wrote:
>If UCS-2LE is a *standard* encoding (and it is in fact mentioned in UTR-17),
>how does VFAT directories qualify as a "higher level protocol"?
>
>My understanding of "higher level protocol" is that it is a *non* standard
>usage of some kind, allowed internally (or within a private group), that
>should not be transmitted to the world at large.
>
>Does this mean that MS's VFAT directories miss something (e.g. a BOM) to be
>a true UCS-2LE?
>
>Or are you simply meaning that the internals of an operating system are a
>"higher level protocol" by definition (even if they casually comply with
>some standard)?

Since you have to abide by other specifications in order to even get to the 
data field with the directory name, all of VFAT is a higher level protocol.

In that sense, any OS and its API are higher level protocols.

Now, HLP's are definitely free to override certain things (including byte 
order), definitely not free to override others (no use of unassigned code 
points) and there's (in my view) a bit of a gray area in between where our 
understanding of where Unicode needs to allow flexibility for 
implementations sake and where it must restrict implementations in order to 
make meaningful guarantees about what constitutes Unicode is likely to 
evolve further.

A./



Re: Unicode in VFAT file system

2000-07-21 Thread John Cowan

[EMAIL PROTECTED] wrote:

> Why does it say there are three varieties when a 16-bit datum can only be
> serialised in two orders? 

The simplest way to think about it is to remember that a MIME charset is meant
to provide *minimal* information for the receiver to convert bytes into
characters.  If the receiver gets FF FE 01 02, then it *must* be interpreted
as follows depending on the charset:

UTF-16:   U+0201
UTF-16BE: U+FFFE U+0102
UTF-16LE: U+FEFF U+0201

For any given byte sequence, at most two of the charsets produce a meaningful
sequence of characters, since U+FFFE is not a character, but that doesn't
affect charset decoding.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



Re: Unicode in VFAT file system

2000-07-21 Thread Michael \(michka\) Kaplan

I sort of misspoke I do not mean that they only support UCS-2. I just
mean that usually in their products, interfaces, APIs, etc., they talk about
Unicode and they happen to be referring to UCS-2 (except on Windows 2000,
where surrogate support is definitely there and thus its more UTF-16).

That is why when they added UTF-8 to the notepad "save as" menu, they did
not say Unicode (UTF-8). Maybe they believed it would be confusing?

michka


- Original Message -
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Friday, July 21, 2000 4:50 AM
Subject: Re: Unicode in VFAT file system


>
> >In the meantime, Microsoft is still pretty firmly rooted in the idea that
> >Unicode=USC-2 (or UTF-16le on Windows 2000).
>
> I don't think we can make a blanket statement about MS being firmly rooted
> in USC-2. They're very big and manage lots of code that a lot of people
use
> on a regular basis, and different parts of their code are in different
> states. This thread started with VFAT, and it has been observed that,
> perhaps, VFAT handles UTF-16 perfectly well while the UI that displays
VFAT
> strings still only understands UCS-2. Considering some other parts of
their
> code: Apple updated the TrueType spec to handle non-BMPs and I understand
> that MS has already implemented support for that. So, somewhere their code
> is handling UTF-16. Chris Pratley has indicated on this list or in his
last
> Unicode presentation that the next version of Office will handle non-BMPs,
> so it will soon be doing UTF-16. But then, many of the controls that come
> with their development tools are still only support "ANSI" text (e.g. in
VB
> you store Unicode strings, but don't expect to display anything not
> supported by a codepage in a list box).
>
>
> - Peter
>
>
> --
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
>
>
>




Re: Unicode in VFAT file system

2000-07-21 Thread John Cowan

Mark Davis wrote:

> The best way I find to think of UCS-2 at this point is *not*
> (𝑛𝑜𝑡)

The variables named 𝑛, 𝑜, and 𝑡 are not defined
anywhere in your posting, so I cannot tell what their product might be.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



Re: Unicode in VFAT file system

2000-07-21 Thread Peter_Constable


>As a serialization, UTF-16 has three forms: UTF-16, UTF-16BE, and
UTF-16LE. The
>first is with (optionally) a BOM, and the others without.

I know this is what the Standard dictates, and I think I understand why,
but it doesn't make complete sense to the novice trying to find his/her
way:


Why does it say there are three varieties when a 16-bit datum can only be
serialised in two orders? If the scheme UTF-16 doesn't have a BOM, isn't it
just one of the other two? When it does have a BOM, it can still be
serialised in two ways, so aren't there four schemes - 2 serialisations x
±BOM? I barely manage to make sense of forms and schemes and then they
confuse me with this stuff!


Don't we really mean that there are three approved ways in which the
encoding scheme of a stream can be labelled? Wouldn't it be clearer to say
that UTF-16 has two serialisations (not forms! since were talking about
schemes), and that the encoding scheme of a stream can be labelled in one
of three ways: UTF-16, UTF-16BE and UTF-16LE?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: Unicode in VFAT file system

2000-07-21 Thread Mark Davis

Unicode has changed and evolved over the years. At this point, UCS-2 is a funny
beast, because it shares precisely the same encoding space as UTF-16. That is,
in code units there is absolutely no difference between them. The only real
difference is whether you interpret the code units in the range D800..DFFF.
(Interpret them correctly, of course!)

As a serialization, UTF-16 has three forms: UTF-16, UTF-16BE, and UTF-16LE. The
first is with (optionally) a BOM, and the others without. Since UCS-2 shares the
same coding space, and thus serialization, it's not really a good idea to speak
of UCS-2LE etc.; much better to just use the UTF-16 names.

The best way I find to think of UCS-2 at this point is *not*
(𝑛𝑜𝑡)  another encoding, but rather simply a shorthand
for a particular supported subset of UTF-16. In that way, it is like other
subsets: for example, I can talk about the Cyrillic-block repertoire in UTF-16.

Mark




Re: Unicode in VFAT file system

2000-07-21 Thread Peter_Constable


On 07/21/2000 04:58:15 AM <[EMAIL PROTECTED]> wrote:

>If UCS-2LE is a *standard* encoding (and it is in fact mentioned in
UTR-17),
>how does VFAT directories qualify as a "higher level protocol"?

It is, essentially, a closed system that can apply proprietary conventions
to how it chooses to represent information. If it wanted, it could take
UTF-16 codes and rearrange the nibbles in the order 3 1 0 2, and it could
still be conformant. All that matters for conformance is that it do the
right thing when it transmits or receives textual data: among other things,
it needs to use an approved encoding form/scheme.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: Unicode in VFAT file system

2000-07-21 Thread Marco . Cimarosti

Asmus Freytag wrote:
> At 09:53 AM 7/20/00 -0800, Ken Krugler wrote:
> >2. Is little-endian UCS-2 a valid encoding that I just don't 
> know about?
> 
> Yes, it is. Your example of the VFAT system is a near perfect 
> case, since
> the details of it form what Unicode calls a 'Higher level 
> protocol' and
> those may legitimately override the default byte order.

OK, one more myth is falling (UCS-2 being mandatorily BE). But I am still
confused with some details.

If UCS-2LE is a *standard* encoding (and it is in fact mentioned in UTR-17),
how does VFAT directories qualify as a "higher level protocol"?

My understanding of "higher level protocol" is that it is a *non* standard
usage of some kind, allowed internally (or within a private group), that
should not be transmitted to the world at large.

Does this mean that MS's VFAT directories miss something (e.g. a BOM) to be
a true UCS-2LE?

Or are you simply meaning that the internals of an operating system are a
"higher level protocol" by definition (even if they casually comply with
some standard)?

_ Marco



Re: Unicode in VFAT file system

2000-07-21 Thread Peter_Constable


On 07/21/2000 04:42:05 AM <[EMAIL PROTECTED]> wrote:

>Unicode is the code, which is based on 16 bit chunks of ether or whatever,
and
>UTF-8 is a biased transformation format...

That's too simple to capture the current reality, as others have been
indicating. The full story is availble in UTR17, and *everybody* on this
list ought to read and digest it - of all the UTRs, it's probably the one
that's most useful to be read by the broadest audience.

http://www.unicode.org/unicode/reports/tr17/

In a nutshell, Unicode started life being 16-bit monowidth, but the need to
extend and merge with ISO 10646 made life more complicated. At this point,
there is no real option but to say that Unicode is a 21 (or 20.1) bit*
character set combined with various encoding forms and schemes based on 8,
16 or 32 bit data types.

* The codespace for the encoded character set takes a little explanation.
The simplification is that it's 0 - 10 (which takes 21 bits to
represent but doesn't go as far as 21 bits would allow - that would be
1F). Actually, you have to remove from this D800 - DFFF and 34 values
that match the pattern FE and FF.


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: Unicode in VFAT file system

2000-07-21 Thread Peter_Constable


>In the meantime, Microsoft is still pretty firmly rooted in the idea that
>Unicode=USC-2 (or UTF-16le on Windows 2000).

I don't think we can make a blanket statement about MS being firmly rooted
in USC-2. They're very big and manage lots of code that a lot of people use
on a regular basis, and different parts of their code are in different
states. This thread started with VFAT, and it has been observed that,
perhaps, VFAT handles UTF-16 perfectly well while the UI that displays VFAT
strings still only understands UCS-2. Considering some other parts of their
code: Apple updated the TrueType spec to handle non-BMPs and I understand
that MS has already implemented support for that. So, somewhere their code
is handling UTF-16. Chris Pratley has indicated on this list or in his last
Unicode presentation that the next version of Office will handle non-BMPs,
so it will soon be doing UTF-16. But then, many of the controls that come
with their development tools are still only support "ANSI" text (e.g. in VB
you store Unicode strings, but don't expect to display anything not
supported by a codepage in a list box).


- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: Unicode in VFAT file system

2000-07-21 Thread Jonathan Rosenne

Unicode is the code, which is based on 16 bit chunks of ether or whatever,
and UTF-8 is a biased transformation format designed to save American and
Western Europeans storage space and to give some people a warm feeling by
keeping Unicode in the familiar 8 bit world.

Jony


> -Original Message-
> From: Michael (michka) Kaplan [mailto:[EMAIL PROTECTED]]
> Sent: Friday, July 21, 2000 9:44 AM
> To: Unicode List
> Subject: Re: Unicode in VFAT file system
>
>
> Although there is some truth here the fact is that it is not
> really true
> today that everyone equates the two. The default thought on people's minds
> these days when they think of Unicode is UTF-8, it seems like. And this is
> mainly due to applications of Unicode to the web, I think.
>
> In the meantime, Microsoft is still pretty firmly rooted in the idea that
> Unicode=USC-2 (or UTF-16le on Windows 2000). UTF-8 is named UTF-8 and
> considered to be a multibyte encoding.
>
> michka
>
>
> - Original Message -
> From: "Doug Ewell" <[EMAIL PROTECTED]>
> To: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Thursday, July 20, 2000 10:41 PM
> Subject: Re: Unicode in VFAT file system
>
>
> > Addison Phillips <[EMAIL PROTECTED]> wrote:
> >
> > > Avoiding for the moment the word-parsing that Markus suggests, Unicode
> > > on Microsoft platforms has always been LE (at least on Intel) and they
> > > have called the encoding they use "UCS-2" (when they bothered with
> > > such things: in the past they always called it "Unicode" as if it were
> > > the *only* encoding). As Unicode has evolved, Microsoft products have
> > > become more exact in this regard.
> >
> > I remember that in the early to mid '90s, before the invention (or at
> > least widespread use) of UTF-8, UTF-32, and surrogates, *everybody* --
> > not just Microsoft -- used the term "Unicode" to refer to what we would
> > now call UCS-2.  Even the Unicode Consortium did this!  And even now,
> > the few of my co-workers who know about Unicode (I'm trying to spread
> > the word, folks, honest) think a "Unicode text file" is UCS-2 by
> > definition.  I don't know what they would think of a UTF-8 file --
> > nobody but me is knowingly using them yet.  In any case, this usage is
> > by no means confined to Microsoft.
> >
> > -Doug Ewell
> >  Fullerton, California
> >
>




Re: Unicode in VFAT file system

2000-07-20 Thread Michael \(michka\) Kaplan

Although there is some truth here the fact is that it is not really true
today that everyone equates the two. The default thought on people's minds
these days when they think of Unicode is UTF-8, it seems like. And this is
mainly due to applications of Unicode to the web, I think.

In the meantime, Microsoft is still pretty firmly rooted in the idea that
Unicode=USC-2 (or UTF-16le on Windows 2000). UTF-8 is named UTF-8 and
considered to be a multibyte encoding.

michka


- Original Message -
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, July 20, 2000 10:41 PM
Subject: Re: Unicode in VFAT file system


> Addison Phillips <[EMAIL PROTECTED]> wrote:
>
> > Avoiding for the moment the word-parsing that Markus suggests, Unicode
> > on Microsoft platforms has always been LE (at least on Intel) and they
> > have called the encoding they use "UCS-2" (when they bothered with
> > such things: in the past they always called it "Unicode" as if it were
> > the *only* encoding). As Unicode has evolved, Microsoft products have
> > become more exact in this regard.
>
> I remember that in the early to mid '90s, before the invention (or at
> least widespread use) of UTF-8, UTF-32, and surrogates, *everybody* --
> not just Microsoft -- used the term "Unicode" to refer to what we would
> now call UCS-2.  Even the Unicode Consortium did this!  And even now,
> the few of my co-workers who know about Unicode (I'm trying to spread
> the word, folks, honest) think a "Unicode text file" is UCS-2 by
> definition.  I don't know what they would think of a UTF-8 file --
> nobody but me is knowingly using them yet.  In any case, this usage is
> by no means confined to Microsoft.
>
> -Doug Ewell
>  Fullerton, California
>




Re: Unicode in VFAT file system

2000-07-20 Thread Doug Ewell

Addison Phillips <[EMAIL PROTECTED]> wrote:

> Avoiding for the moment the word-parsing that Markus suggests, Unicode
> on Microsoft platforms has always been LE (at least on Intel) and they
> have called the encoding they use "UCS-2" (when they bothered with
> such things: in the past they always called it "Unicode" as if it were
> the *only* encoding). As Unicode has evolved, Microsoft products have
> become more exact in this regard.

I remember that in the early to mid '90s, before the invention (or at
least widespread use) of UTF-8, UTF-32, and surrogates, *everybody* --
not just Microsoft -- used the term "Unicode" to refer to what we would
now call UCS-2.  Even the Unicode Consortium did this!  And even now,
the few of my co-workers who know about Unicode (I'm trying to spread
the word, folks, honest) think a "Unicode text file" is UCS-2 by
definition.  I don't know what they would think of a UTF-8 file --
nobody but me is knowingly using them yet.  In any case, this usage is
by no means confined to Microsoft.

-Doug Ewell
 Fullerton, California



Re: Unicode in VFAT file system

2000-07-20 Thread addison

Well...

There has always been a BOM in Unicode and it's there for a reason: to
indicate the byte order on different processors. There is an inherent BE
bias in Unicode. But this doesn't invalidate an LE view of the Universe.

Avoiding for the moment the word-parsing that Markus suggests, Unicode on
Microsoft platforms has always been LE (at least on Intel) and they have
called the encoding they use "UCS-2" (when they bothered with such
things: in the past they always called it "Unicode" as if it were the
*only* encoding). As Unicode has evolved, Microsoft products have become
more exact in this regard.

You'll never *hear* about "UCS-2LE"... since I just invented it to
describe what's going on. In non-standard usage, UCS-2 means "doesn't
support surrogates" and LE happens to be what PCs are using. Calling
something UTF-16 that doesn't support surrogates is bad, as far as I'm
concerned.

VFAT may support surrogates (AFAIK it does). Microsoft is usually quite
good at indicating UTF-16 support in their documentation.

> 
> >3. Filenames are, by definition in Windows-land, UPPERCASE in Western
> >European systems.
> 
> My understanding is that with DOS they were always upper-cased, but 
> probably only for the Western European code pages. With VFAT, the 
> file names are stored as-is, but checked for uniqueness using 
> case-insensitivity (but only in the basic Latin and Latin-1 
> supplement range).

Sure, but they haven't abandoned this behavior in more modern operating
systems. The upper-casing is done to support DOS compatibility, which is
important in a Microsoft networking environment.

thanks,

Addison




Re: Unicode in VFAT file system

2000-07-20 Thread Asmus Freytag

At 11:41 AM 7/20/00 -0800, Ken Krugler wrote:
>No. UCS-2 and UCS-4 have always been bigendian. Read ISO 10646-1:1993,
>section "6.3 Octet order" (page 7):
>
>   When serialized as octets, a more significant octet shall
>   precede less significant octets.

The section continues: "When not serialized as octets the order of octets 
may be specified by an agreement between sender and recipient (see claus 
17.1 and Annnex F )"

Annex F introduces the BOM.

On the face of it the two parts of clause 6.3 seem to be a bit 
self-contradictory and could possibly stand some editorial clarification, 
but on the whole, even ISO/IEC 10646 recognizes that other byte orders 
exist and suggests means (in Annex F) how sender and recipient might 
communicate this fact.

Since the time of writing for this clause (1991), both the amount of data 
in the various byte order, and practical experience with Unicode has 
increased dramatically and the full discussion is available in

http://www.unicode.org/unicode/reports/tr17 Character Encoding Model

as well as the relevant sections of The Unicode Standard, Version 3.0


Note that there is no such thing as UCS-2LE or UCS-2BE. These terms are not 
defined anywhere, but UTF-16LE and UTF-16BE are. Unicode has adopted the 
philosophy that indications of subsets (e.g. surrogate-accessible 
characters supported or not) is not something that belongs in the 
designation of the encoding form.

A./



Re: Unicode in VFAT file system

2000-07-20 Thread Asmus Freytag

At 11:34 AM 7/20/00 -0800, John Cowan wrote:
> > 1. Could it be using UTF-16LE? I tried creating an entry with a
> > surrogate pair, but the name was displayed with two black boxes on a
> > Windows 2000-based computer, so I assumed that surrogates were not
> > supported.
>
>Probably not.  So technically it *is* UCS-2 (LE) rather than UTF-16LE.

If the data has a BOM it's UTF-16, if it doesn't, but is known to be little 
endian it's UTF-16LE.

Seeing two boxes is not conclusive. If the file system happily stores these 
at the original code points, then *it* would support UTF-16, even if the 
shell, or the text output API underneath it, that was used to display the 
name does not.

Only by looking at the string with a debugger would you know for sure that 
VFAT indeed kept the string and didn't mangle the string.

One other possible test would be to try to present the file system with a 
string that's longer than the maximum number of characters, where the last 
character is an unpaired surrogate - it would be interesting to see what it 
does.

A./



Re: Unicode in VFAT file system

2000-07-20 Thread Ken Krugler

Hi Addison,

>UCS-2 is pretty close to the same thing as UTF-16. The differences do not
>apply here.
>
>UCS-2 can be big-endian or little-endian. The rule is that BE is the
>default. However, on Intel platforms, you shouldn't be surprised to see LE
>everywhere: that's the architecture. Microsoft is saving two bytes for
>every filename by not storing a BOM.

Thanks for the fast response. I was basing my understanding of UCS-2 
always being big-endian on Marcus Kuhn's prior email, which said:

At 2:58am -0800 00-02-18, Markus Kuhn wrote:
>Date: Fri, 18 Feb 2000 02:58:51 -0800 (PST)
>From: Markus Kuhn <[EMAIL PROTECTED]>
>Subject: Re: UCS-4, UCS-2, UTF-16, UTF-8
>To: Unicode List <[EMAIL PROTECTED]>
>X-UML-Sequence: 12380 (2000-02-18 10:58:53 GMT)
>
>Yung-Fong Tang wrote on 2000-02-17 21:18 UTC:
>  > UCS-4 does not specify byte order, but UTF-32BE and
>  > UTF-32LE does.
>
>No. UCS-2 and UCS-4 have always been bigendian. Read ISO 10646-1:1993,
>section "6.3 Octet order" (page 7):
>
>   When serialized as octets, a more significant octet shall
>   precede less significant octets.
>
>ISO and ITU have fortunately always frowned upon Intel's horrible 1970s
>decision of staying compatible with some obscure long-forgotten 1960s
>mainframe for which they had bought some software when they made the
>8080 a littleendian processor (Intel's microcontrollers by the way are
>all bigendian, as is pretty much anything else that was not designed to
>be Intel compatible).

So now I'm a bit confused, since I've never heard of UCS-2LE/UCS-2BE.

>You should note that Microsoft *means* UCS-2LE (and UTF-16LE in more
>modern systems) when they say "Unicode" (at least on Intel platforms).
>
>So:
>
>1. Yes, it is perfectly valid.
>2. There are no characters in the surrogate space just yet, so a black
>square should be no surprise. Two black squares means that it's being
>treated as UCS-2.

Does anybody know if Microsoft has publicly stated if/when they'll 
support surrogates in VFAT file names?

>3. Filenames are, by definition in Windows-land, UPPERCASE in Western
>European systems.

My understanding is that with DOS they were always upper-cased, but 
probably only for the Western European code pages. With VFAT, the 
file names are stored as-is, but checked for uniqueness using 
case-insensitivity (but only in the basic Latin and Latin-1 
supplement range).

>Other scripts either don't have the concept of case or
>weren't mucked with. This includes compatibility characters stored outside
>the U+ to U+00FF range.

OK - this matches the behavior I was seeing with Japanese Windows 
systems, where full-width Romaji isn't case-folded before checking 
file names.

Thanks,

-- Ken

>===
>Addison P. PhillipsPrincipal Consultant
>Inter-Locale LLChttp://www.inter-locale.com
>Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]
>
>+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
>===
>Globalization Engineering & Consulting Services
>
>On Thu, 20 Jul 2000, Ken Krugler wrote:
>
>  > Hi Unicoders,
>  >
>  > Recently I've had the dubious pleasure of delving into the details of
>  > the VFAT file system. For long file names, I thought it used UCS-2,
>  > but in looking at the data with a disk editor, it appears to be
>  > byte-swapping (little endian). I thought that UCS-2 was by definition
>  > big endian, thus I've got the following questions:
>  >
>  > 1. Could it be using UTF-16LE? I tried creating an entry with a
>  > surrogate pair, but the name was displayed with two black boxes on a
>  > Windows 2000-based computer, so I assumed that surrogates were not
>  > supported.
>  >
>  > 2. Is little-endian UCS-2 a valid encoding that I just don't know about?
>  >
>  > 3. And finally, why are file names case-insensitive for characters in
>  > the U- to U-00FF range, but not for any other characters? OK,
>  > maybe I can guess at the answer to that one...
>  >
>  > Thanks,
>  >
>  > -- Ken
>  > Ken Krugler
>  > TransPac Software, Inc.
>  > 
>  > +1 530-470-9200
>  >

Ken Krugler
TransPac Software, Inc.

+1 530-470-9200



Re: Unicode in VFAT file system

2000-07-20 Thread John Cowan

Ken Krugler wrote:

> I thought that UCS-2 was by definition big endian

It's big-endian by *default*.  If you have a BOM, you can determine the
polarity directly, but putting a BOM in every file name would be silly.
Windows file systems will only be used on LE machines, so storing everything
as LE is sensible (and is what Unicode calls a "higher-level protocol").

> 1. Could it be using UTF-16LE? I tried creating an entry with a
> surrogate pair, but the name was displayed with two black boxes on a
> Windows 2000-based computer, so I assumed that surrogates were not
> supported.

Probably not.  So technically it *is* UCS-2 (LE) rather than UTF-16LE.

> 3. And finally, why are file names case-insensitive for characters in
> the U- to U-00FF range, but not for any other characters? OK,
> maybe I can guess at the answer to that one...

Case insensitivity is a backwards-compatibility hack, basically.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



Re: Unicode in VFAT file system

2000-07-20 Thread Asmus Freytag

At 09:53 AM 7/20/00 -0800, Ken Krugler wrote:
>2. Is little-endian UCS-2 a valid encoding that I just don't know about?

Yes, it is. Your example of the VFAT system is a near perfect case, since
the details of it form what Unicode calls a 'Higher level protocol' and
those may legitimately override the default byte order.

A./




RE: Unicode in VFAT file system

2000-07-20 Thread Yves Arrouye

> Recently I've had the dubious pleasure of delving into the details of 
> the VFAT file system. For long file names, I thought it used UCS-2, 
> but in looking at the data with a disk editor, it appears to be 
> byte-swapping (little endian). I thought that UCS-2 was by definition 
> big endian, thus I've got the following questions:
> 
> 1. Could it be using UTF-16LE? I tried creating an entry with a 
> surrogate pair, but the name was displayed with two black boxes on a 
> Windows 2000-based computer, so I assumed that surrogates were not 
> supported.

It is UTF-16 (LE, because of Intel architecture), and AFAIK there are is no
surrogate support yet. Not that there would be anything to display, except
one box instead of two :)
 
YA



Re: Unicode in VFAT file system

2000-07-20 Thread addison

Hi Ken,

UCS-2 is pretty close to the same thing as UTF-16. The differences do not
apply here.

UCS-2 can be big-endian or little-endian. The rule is that BE is the
default. However, on Intel platforms, you shouldn't be surprised to see LE
everywhere: that's the architecture. Microsoft is saving two bytes for
every filename by not storing a BOM.

You should note that Microsoft *means* UCS-2LE (and UTF-16LE in more
modern systems) when they say "Unicode" (at least on Intel platforms).

So:

1. Yes, it is perfectly valid.
2. There are no characters in the surrogate space just yet, so a black
square should be no surprise. Two black squares means that it's being
treated as UCS-2.
3. Filenames are, by definition in Windows-land, UPPERCASE in Western
European systems. Other scripts either don't have the concept of case or
weren't mucked with. This includes compatibility characters stored outside
the U+ to U+00FF range.

Regards,

Addison

===
Addison P. PhillipsPrincipal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Los Gatos, CA, USA  mailto:[EMAIL PROTECTED]

+1 408.210.3569 (mobile)  +1 408.904.4762 (fax)
===
Globalization Engineering & Consulting Services

On Thu, 20 Jul 2000, Ken Krugler wrote:

> Hi Unicoders,
> 
> Recently I've had the dubious pleasure of delving into the details of 
> the VFAT file system. For long file names, I thought it used UCS-2, 
> but in looking at the data with a disk editor, it appears to be 
> byte-swapping (little endian). I thought that UCS-2 was by definition 
> big endian, thus I've got the following questions:
> 
> 1. Could it be using UTF-16LE? I tried creating an entry with a 
> surrogate pair, but the name was displayed with two black boxes on a 
> Windows 2000-based computer, so I assumed that surrogates were not 
> supported.
> 
> 2. Is little-endian UCS-2 a valid encoding that I just don't know about?
> 
> 3. And finally, why are file names case-insensitive for characters in 
> the U- to U-00FF range, but not for any other characters? OK, 
> maybe I can guess at the answer to that one...
> 
> Thanks,
> 
> -- Ken
> Ken Krugler
> TransPac Software, Inc.
> 
> +1 530-470-9200
> 




Unicode in VFAT file system

2000-07-20 Thread Ken Krugler

Hi Unicoders,

Recently I've had the dubious pleasure of delving into the details of 
the VFAT file system. For long file names, I thought it used UCS-2, 
but in looking at the data with a disk editor, it appears to be 
byte-swapping (little endian). I thought that UCS-2 was by definition 
big endian, thus I've got the following questions:

1. Could it be using UTF-16LE? I tried creating an entry with a 
surrogate pair, but the name was displayed with two black boxes on a 
Windows 2000-based computer, so I assumed that surrogates were not 
supported.

2. Is little-endian UCS-2 a valid encoding that I just don't know about?

3. And finally, why are file names case-insensitive for characters in 
the U- to U-00FF range, but not for any other characters? OK, 
maybe I can guess at the answer to that one...

Thanks,

-- Ken
Ken Krugler
TransPac Software, Inc.

+1 530-470-9200