subject:"Re\: Roundtripping in Unicode"

Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk replied:
Arcane Jill [EMAIL PROTECTED] writes:

If so, Marcin, what exactly is the error, and whose fault is it?

It's an error to use locales with different encodings on the same
system.

U, and whose fault is it?

You can advise the users against it, but they won't necessarily listen.

Switching to UTF-8 on UNIX opens two possibilities:

1 - Users that HAD different encodings on the same system will now only have one, namely UTF-8.
2 - Users that didn't have different encodings now may end up with different (and quite incompatible) encodings.

Assuming everything will happen quickly, and on all systems is ... well, ignorant.

Once it happens, offending filenames should be rare. One could creep in for various reasons, not limited to malicious attempts.

Automated or assisted upgrades to UTF-8 have been mentioned. For those that will be able to use them, great. I would even go a step further. I would icorporate a switch into UNIX filesystems that would enable a validator. This validator would reject invalid UTF-8 filenames from being created to start with (along with some other characters). This is quite un-UNIX-like, but then so it UTF-8. Perhaps then we can declare UNIX filenames as text. Well, for the most part. Except for some applications that WILL still need to be able to access all files even on systems whose users will not decide (perhaps for valid reasons!) to enable that validator.

Lars

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode





Ops, correction:


In response to Marcin 'Qrczak' Kowalczyk
 Question: should a new programming language which uses Unicode for 
 string representation allow non-characters in strings? Argument for 
 allowing them: otherwise they are completely useless at all, except 
 U+FFFE for BOM detection. Argument for disallowing them: they make 
 UTF-n inappropriate for serialization of arbitrary strings, and thus 
 non-standard extensions of UTF-n must be used for serialization. 


I wrote:


My opinion: 
 It should allow them and process them usefully. Furthermore, this 
 'usefully' should not be up to developers to discover. It should be 
 researched, described, well, in the end even standardized. IMHO, UTC 
 should consider leading this process, even if it does not end with 
 anything standardized in Unicode standard.

 Validation should be completely separated from processing. IMHO. 


I wasn't paying attention to what Marcin wrote, namely the term non-characters.
What I wrote goes for invalid sequences and surrogates.


Lars

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode





D. Starner wrote:
 The only solution is (a) to use ASCII or (b) to make the 
 switch over as quick 
 and clean as possible. Anyone who wants to create new files 
 in UTF-8 and leave 
 their old files in the old encoding is asking for trouble. 
 There's no magic
 bullet, and complaining here as much as you want won't help. 
 If you're a
 system administrator, explain that to the people using your 
 system, and
 treat stupid responses just like you would any LART-worthy response.


A lone IT guy in a small company is not really in a position to take that stance. His user is also his boss. And it gets more complicated when thousands of systems in a network are involved. And if guys in the IT department realize the risks, and know they will be blamed for any inconvenience? Perhaps they will decide that the switch to UTF-8 is not really needed in their company. Though, some users will start using UTF-8 on their own. And come complaining about the problems. And IT will again try to balance what to do. Except now it's even worse, since not all filenames are in Latin 1. And so on.

Lars

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

Kenneth Whistler wrote:
Lars said:

According to UTC, you need to keep processing
the UNIX filenames as BINARY data. And, also according to
UTC, any UTF-8
function is allowed to reject invalid sequences. Basically,
you are not
supposed to use strcpy to process filenames.

This is a very misleading set of statements.
Perhaps deliberately so.

First of all, the UTC has not taken *any* position on the
processing of UNIX filenames.
At this point, I won't make any statement about whether UTC should or need not do that.
Let me just ask if it is appropriate to discuss such issues on this list?

It is erroneous to imply that the UTC has indicated that you
are not supposed to use strcpy to process filenames.
As long as explanatins about validation aren't misinterpreted by some people. Is there a thorough explanation of where and how to apply validation anywhere in the standard?

Any process *interpreting* a UTF-8 code unit sequences as
characters can and should recognize invalid sequences, but
that is a different matter.
OK, strcpy does not need to interpret UTF-8. But strchr probably should. Or, is it that strchr is for opaque strings and mbschr is for UTF-8 strings? Then strchr should remain as is and be used for processing filenames. Hopefully, you do not need to search for Unicode characters in it and strchr-ing for '/' is all you need. But then all languages are supposed to provide functions for processing opaque strings in addition to their Unicode functions. Or, alternatively, they need to carefully define how string functions should process invalid sequences. If that can be done at all.

But sooner or later you need to incorporate the filename in some UTF-8 text. An error report, for example. You then need to program the boundaries quite carefully.

Not to mention the cost to maintain existing programs. I think it makes sense to keep looking for other solutions.

If I pass the byte stream 0x80 0xFF 0x80 0xFF 0x80 0xFF to
a process claiming conformance to UTF-8 and ask it to intepret
that as Unicode characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

What are stdin, stdout and argv (command line parameters) when a process is running in a UTF-8 locale? Binary? Opaque strings? UTF-8?

Unicode did not invent the notion of conformance to character
encoding standards. What is new about Unicode is that it has
*3* interoperable character encoding forms, not just one, and
all of them are unusual in some way, because they are designed
for a very, very large encoded character repertoire, and
involve multibyte and/or non-byte code unit representations.

The difference is that far more people will be faced with such problems.

Lars

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk wrote:
If one application switches from standard UTF-8 to your modification,
and another application continues to use standard UTF-8, then the
ability to pass arbitrary Unicode strings between them by serializing
them to UTF-8 is lost. So you can't claim that does not affect
programs which don't adopt it. It would have to be adopted by all
programs which currently use UTF-8, or data exchange would break.

I don't think so. If I produce UTF-8 data from filenames, and give it to an UTF-8 application, nothing can be lost in the portion of this architecture that deals with Unicode data. Now, if you expect that you can give me Unicode data and I should store it in a filesystem (as a filename), then you're in error. It is definitely true that you can create a sequence of valid Unicode characters from my range and I will not be able to give it back. But I will also have to reject any '/' characters you feed me. You are misusing my application.

If some application chooses to use my conversion and looses or misinterpretes your data, then it is broken and shouldn't use that conversion or should not declare that particular interface as Unicode interface.

But it's not a viable replacement of UTF-8. Even if both applications
use your modification, the ability to serialize arbitrary sequences
of valid code points (i.e. not surrogates) through UTF-8 is lost: the
mapping to modified UTF-8 is not injective.
Yes, that is true. But there are people who would be willing to accept that since it only happens if those 128 codepoints are used. Those can use the conversion, others needn't.

OK, there is one problem that I *do* see with the use of my conversion. I map a file from UX to Win. You then use not my application, but another one, which copies the file back from Win to UX (and that is easier, so you *can* use this application). Now the invalid sequence is already escaped. If I map this new file to Win again, I need to escape the escape. They can start piling up.

Of course you can realize the problem, and simply rename the file, you can undo the over-escaping (no data is ever lost!), and probably rename that file to valid UTF-8, which is what you want anyway. And, you can do it even from the Windows system. If you prevent my solution, you will not have my program in the first place, meaning you will need to go to the UNIX system to rename the file, and that even in order to access it in the first place.

Actually, there are two subflavors of my conversion possible (I can hear you say oh, n). One does escape the escapes, the other doesn't. This second flavor can be used by applications that need to make UTF-8 from an arbitrary input, but do not need to re-create the original byte sequence. Basically, they are preserving all the data, except for the information how many times the original invalid sequences were escaped. There may be a need for such applications and they would in fact reduce the re-escaping problem.

Which means that UTF-8 can't be replaced with your modification.
If they coexisted, expect trouble when the two slightly incompatible
encodings meet.
Or, expect trouble when dealing with data that is not guaranteed to be UTF-8. Or hope that there will be no such data, in near future, and I mean none.

Using my conversion, Windows can access any file on UNIX, because my
conversion guarantees roundtrip UX=Win=UX

Well, with or without your conversion it's not true, because there
are various characters which are valid in Unix filenames but not in
Windows (e.g. ? * : \ and control characters). So if all filenames are
to be accessible, they have to introduce some escaping. And as soon
as an escaping scheme is used, it can be extended to encode isolated
bytes with high bit set.
Good point. But you are assuming I copy the files to Windows filesystem. I don't. I have no problems if you specify your filename with any of the above characters, even from Windows.

And, BTW, suppose UTF-8 validation is introduced (as an option) on UNIX filesystems. The characters you mention (and some other, I can tell you exactly which don't work on Windows) could again be (optionally) rejected on UNIX filesystems.

Win=UX=Win roundtrip is not guaranteed.

Currently it breaks only for isolated surrogates (assuming the Unix
is configured to use UTF-8). If Windows filenames are specified to be
UTF-16, the error is clearly on the Windows side and this side should
be fixed.
And in my case, it would break for some malicious sequences of the 128 codepoints. Equally rare, and with equal minor consequences. U, and it can be fixed, too. Such malicious sequences could be forbidden in contexts where we fear they might cause problems.

Lars

Re: Roundtripping in Unicode

2004-12-15 Thread Mark Davis

 Nope. No data corruption. You just get the odd bytes back. And achieve

I see more of what you are trying to do; let me try to be more clear.
Suppose that the conversion is defined in the following way, between Unicode
strings (D29a-d, page 74) and UTFs using your proposed new characters, for
now with private use code points U+E080..U+E0FF.

U8-UTF32. To convert an Unicode 8-bit string to UTF-32:
1. Set the pointer to the start
2. If the sequence starting at the pointer is a valid UTF-8 sequence
(checking of course to make sure it doesn't go off the end of the string),
convert it and emit.
3. Otherwise take the byte B following the pointer, and emit [E000 + B].

- Note that because all single bytes 00..7F are all valid UTF-8, #3
doesn't get invoked on anything but 80..FF.
4. Advance the pointer past what was used and repeat until done


UTF32-U8. To convert a UTF-32 to a Unicode 8-bit string:
1. Set the pointer to the start
2. If the code point C at the pointer is from E080 to E0FF, emit a single
byte, [C - E000]
3. Otherwise convert to the UTF-8 sequence and emit.
4. Advance the pointer past what was used and repeat until done


Taking any byte string, it would roundtrip when applying U8-UTF32 then
UTF32-U8. However, the reverse would not be true; UTF-32 strings would not
roundtrip through U8. For example,

start with UTF32: 00A0 E0C2 E0A0
applying UTF32-U8, goes to: C2 A0 C2 A0
applying U8-UTF32, goes to: 00A0 00A0

Of course, a UTF32-UTF8 transformation would preserve these code points

 00A0 E0C2 E0A0 = C2 A0 EE 83 82 EE 82 A0

so it would behave differently than the UTF32-U8 conversion.


Of course, one could apply this process between the Unicode bit strings and
UTFs of other widths. And the same thing applies; one direction would
roundtrip and the other wouldn't.

start with UTF8: C2 A0 EE 83 82 EE 82 A0
applying UTF8-U8, goes to: C2 A0 C2 A0
applying U8-UTF8, goes to: C2 A0 C2 A0


(I realize that some of this may duplicate what others have said -- I
haven't had the time to follow this thread in any detail.)

Mark

- Original Message - 
From: Lars Kristan
To: 'Mark Davis' ; Kenneth Whistler
Cc: [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 03:30
Subject: RE: Roundtripping in Unicode


 Ken is absolutely right. It would be theoretically possible
 to add 128 code
 points that would allow one to roundtrip a bytestream after
 passing through
 a UTF-8 = UTF-32 conversion. (For that matter, it would be
 possible to add
 2048 code points that would allow the same for a 16-bit data stream.)
You don't really need to add anything for 16-bit = UTF-32. There is no
real-life need to have that roundtrip guaranteed. For 8-bit data there is
real-life need. And even, for 16-bit = UTF-32 you can do it simply by
defining how surrogates should be processed. Not saying it should be done,
but showing it could be done. But for UTF-8 = UTF-32 it cannot be done
without 128 new codepoints. Which is why I am often comparing these 128
codepoints to the surrogates. With one difference, they should be valid
characters.

 However, these new code points would really be no better than
 private use
 code points, since their interpretation would depend entirely
Oh yes they would. Anyone might be using those same codepoints in PUA for
something completely different.
 on whatever
 was assumed to be the interpretation of the original bytestream. If X
 converted a bytestream that was assumed to be a mixture of
 8858-7 with UTF-8
 into Unicode with these new characters, and handed it off to Y, who
 converted the bytestream back assuming that the odd bytes were to be
 iso-8859-9, you would get data corruption. X and Y would have
Nope. No data corruption. You just get the odd bytes back. And achieve
exactly the same as if X passed the data directly to Y. Y doesn't convert
from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9.
It converts UTF-8 to the original byte stream and ONLY THEN interpretes it
as iso-8859-9. So, the same as if it got the data directly.


Lars

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode






Marcin 'Qrczak' Kowalczyk wrote:
 But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 -
 NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
 awkward way which would happen to exclude those subsequences of
 non-characters which would form a valid UTF-8 fragment.
NOT-UTF-16 - NOT-UTF-8 - NOT-UTF-16 was never a goal. Nor was UTF-16 - NOT-UTF-8 - UTF-16, or NOT-UTF-16 - UTF-8 - NOT-UTF-16.

UTF-16 - UTF-8 - UTF-16 is preserved and that keeps the goals of UTF intact.


The goal, BTW, is: NOT-UTF-8 - UTF-16 - NOT-UTF-8.


 Question: should a new programming language which uses Unicode for
 string representation allow non-characters in strings? Argument for
 allowing them: otherwise they are completely useless at all, except
 U+FFFE for BOM detection. Argument for disallowing them: they make
 UTF-n inappropriate for serialization of arbitrary strings, and thus
 non-standard extensions of UTF-n must be used for serialization.
My opinion:
It should allow them and process them usefully. Furthermore, this 'usefully' should not be up to developers to discover. It should be researched, described, well, in the end even standardized. IMHO, UTC should consider leading this process, even if it does not end with anything standardized in Unicode standard.

Validation should be completely separated from processing. IMHO.



Lars

Re: Roundtripping in Unicode

2004-12-15 Thread Peter Kirk

On 15/12/2004 00:22, Mike Ayers wrote:
 From: Peter Kirk [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, December 14, 2004 3:37 PM
 Thanks for the clarification. Perhaps the bifurcation could
 be better expressed as into strings of characters as defined
 by the locale and strings of non-null octets. Then I could
 re-express this as the only safe way out of this mess is
 never to process filenames as strings of characters as
 defined by the locale.
That would not be correct for ISO 8859 locales, though 
(amongst others).  That's why I specified UTF-8.  Although other 
locales may have the problem of invalid sequences, we're only 
interested in UTF-8 here.

But surely octets 0x80 to 0x9f are (at least mostly) invalid in ISO 
8859? While some applications may choose to process these invalid 
characters as if they were valid, but display them as boxes or not at 
all (and this is a security risk), others and especially those concerned 
with security do in fact treat them as errors, in one way or another. 
For example, Marcin noted for Mozilla:

If a filename ... can be
converted but contains characters like 0x80-0x9F in ISO-8859-2,
they are displayed as question marks and the file is inaccessible.
It should be treated as a general issue with ALL locales and character 
sets (with perhaps just a few exceptions) that not all sequences of 
octets represent valid character strings. UTF-8 is by no means a special 
case here.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 Unix makes is possible for /you/ to change /your/ locale - but by
 your reasoning, this is an error, unless all other users do so
 simultaneously.

Not necessarily: you can change the locale as long as it uses the same
default encoding.

By error I mean a bad idea. The system does not prevent from
changing the locale to a different encoding. But then you are on your
own and various things can break: terminal output will be mangled, you
can't enter characters used in a different encoding from the keyboard,
text files will be illegible, and Unicode programs which process texts
may reject your data or even filenames. If you still need to change
encodings, it's safer to use ASCII-only filenames.

This situation is temporary. Well, it may last 10 more years or so,
but it will probably gradually improve:

First, more protocols and file formats are becoming aware of character
encodings and either label them explicitly or use a known encoding
(generally some Unicode encoding scheme). Especially protocols for
data interchange over Internet: WWW, email, usenet, modern instant
messaging protocols like Jabber. Some old protocols remain
encoding-ignorant, e.g. irc and finger. GNOME 1 used the locale
encoding, GNOME 2 uses UTF-8. Copying  pasting text in X window now
has a separate API which uses UTF-8. While the irc protocol doesn't
specify the encoding, the irssi client can now recode texts itself
to conform to customs of particular channels.

Second, UTF-8 is becoming more usable as the default encoding
specified by the locale. I don't use it now because too many things
still break, but it's improving: there are things which didn't work
just a few years ago and work now. Terminal emulators in X widely
support UTF-8 mode now. The curses library now has a working wide
character API. Emacs and vi work in UTF-8 (Emacs still has problems).
Readline now works in UTF-8. Localized messages (gettext) are now
recoded automatically.

Other programs still don't work. Bash works, while zsh and ksh don't.
Most full-screen text programs use the narrow character curses API and
don't work in UTF-8. Brokenness of interactive interpreters of various
languages vary.

BTW, in the wide character curses API, the only way curses can work
in a UTF-8 terminal, characters are expressed as sequences of wchar_t
(base char + some combining chars, possibly double width). Which means
that you must somehow translate filenames to this representation
in order to display them - same as with a Unicode-based GUI. It's
meaningless to render arbitrary bytes on the terminal, and you can't
force curses to emit the original byte sequences which form filenames
(which would be a bad idea for control characters anyway). By
legimitizing non-UTF-8 filenames in a UTF-8 system you increase
problems to overcome by such applications: not only they have to
show control characters somehow, but also invalid UTF-8.

 But it goes beyond that. Copy a file onto a floppy disc and then
 physically take that floppy disc to a different Unix machine and log
 on as guest and insert the disc ... Will the filename look the same?

Depends on the filesystem and the way it is mounted.

For example if it's FAT with long filenames (which I think is the
usual format for floppies even on Unix), filenames can be recoded by
the kernel: you specify the encoding to present filenames in and the
encoding of short names. I don't know what happens with filenames
which are not expressible in the selected encoding.

In this way filenames may automatically convert between systems which
use different default encodings, preserving the character semantics
rather than the byte representation. Of course file contents will not
be converted.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode






Philippe Verdy wrote:
 I have not 
 found a solution to this problem, and I don't know if such 
 solution even 
 exists; if such solution exists, it should be quite complex...).


I think it should be possible to mathematically prove that it doesn't exist.


So, I claim you cannot achive NOT-UTF-8 = UTF-16 = NOT-UTF-8 and UTF-16 = NOT-UTF-8 = UTF-16 at the same time. But this is not really needed, since nothing of this affects any UTF trip (and none of the above is one).

And, the funny thing is - currently NOT2-UTF-16 = NOT2-UTF-8 = NOT2-UTF-16 *is* possible (NOT2, because it is not the same conversion, it is actually UCS2 conversion). But there is no need for it. NOT-UTF-8 = UTF-16 = NOT-UTF-8 is THE most valuable one. Outside of Unicode that is. Unicode could acknowledge that fact and yield 128 codepoints.


Lars

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

Arcane Jill wrote:
The obvious solution is for all Unix machines everywhere to
be using the
same locale - and it had better be UTF-8. But an instantaneous global
switch-over is never going to happen, so we see this gradual
switch-over ...
and it is during this transition phase that Lars's problem manifests.
Yes, some may not experience it, some will experience it for a day, some for a month, some for a year, some indefinitely.

And unless filesystems prevent invalid sequences to be added, it will keep happening to everybody. And if very seldom, then it will be even harder to find a person who can fix it.

Of course, you are suggesting not /really/ suggesting that
the Unix kernel
be rewritten. But it's hard to for me to see how else this could be
achieved.

What one might pursue is to make the UNIX filesystem invariant, so Windows-like. In that scenario, a filesystem stores Unicode strings and adjusts the representation of filenames according to user's locale. But there are two reasons against it:

A - If only the filesystem does it, then whenever you switch the locale, all references to files in other files break. Unless you treat the files in the same manner, which is what Windows does if an application is not Unicode (with a number of associated problems on top). But that is not what is supposed to be done on UNIX.

B - As we move to UTF-8, there will be less and less need to use different locales. So why bother with enabling the system to represent UTF-8 in any other locale if that locale will not even be used anymore. Concerns with the transition period do apply, but then you end up with two transitions, which is even less appealing.

So, the only percievable option is to start thinking about validation in the filesystem. If and when one choses to enable it. But keep in mind that it will only reduce the problem. Not all programs will be able to rely on it (like virus scanners, HSM, backup, ...).

Lars

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 OK, strcpy does not need to interpret UTF-8. But strchr probably should.

No. Its argument is a byte, even though it's passed as type int.
By byte here I mean C char value, which is an octet in virtually
all modern C implementations; the C standard doesn't guarantee this
but POSIX does.

Many C functions are not suitable for processing UTF-8, or are
suitable only as long as we consider all non-ASCII characters opaque
bags of bytes. For example isalpha takes a byte, toupper transforms
a byte to a byte, and strncpy copies up to n bytes even if it's
in the middle of a UTF-8 character.

There are wide character versions like iswalpha and towupper. But then
data must be converted from a sequence of char to a sequence of wchar_t.
Standard and semi-standard function which do this conversion for UTF-8
reject invalid UTF-8 (they all have a mean for reporting errors).

The assumption that wchar_t has something do to with Unicode is not as
common as about char and bytes. I don't know whether FreeBSD finally
changed their wchar_t to Unicode. And it can be UTF-32 (Unix) or
UTF-16 (Windows).

 But then all languages are supposed to provide functions for
 processing opaque strings in addition to their Unicode functions.

Yes, IMHO all general-purpose languages should support processing
arrays of bytes, in addition to Unicode strings.

It's not clear however how the API of filenames should look like,
especially if they wish to be portable to Windows.

 But sooner or later you need to incorporate the filename in some
 UTF-8 text. An error report, for example.

While it's not clear what a well-behaved application should do by
default, in order to be 100% robust and preserve all information
you must change the usual conventions anyway. Remember that any byte
except \0 and / is valid in a filename, so you must either escape
some characters, or delimit the filename with \0, or prefix it with
the length, or something like this. A backup software should do this
and not pay attention to the locale. But for end-user software like
an image viewer, processing arbitrary filenames is less important.

 What are stdin, stdout and argv (command line parameters) when a
 process is running in a UTF-8 locale?

Technically they are binary (command line arguments must not contain
zero bytes). Users are expecting stdin and stdout to be treated as
text or binary depending on the program, while command like arguments
are generally interpreted as text or filenames.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

RE: Roundtripping in Unicode

2004-12-15 Thread Mike Ayers

Title: RE: Roundtripping in Unicode

From: Peter Kirk [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, December 15, 2004 3:52 AM

But surely octets 0x80 to 0x9f are (at least mostly) invalid
in ISO 8859?

They are in fact valid. However, because they are control characters, they are not considered displayable.

While some applications may choose to process
these invalid characters as if they were valid, but display
them as boxes or not at all (and this is a security risk),
others and especially those concerned with security do in
fact treat them as errors, in one way or another.
For example, Marcin noted for Mozilla:

If a filename ... can be
converted but contains characters like 0x80-0x9F in ISO-8859-2, they
are displayed as question marks and the file is inaccessible.

This is a good policy and is what Lars should consider. It places the responsibility for the filename where it belongs: on the file's creator.

It should be treated as a general issue with ALL locales and
character sets (with perhaps just a few exceptions) that not
all sequences of octets represent valid character strings.
UTF-8 is by no means a special case here.

Exactly. Which underscores just how silly these threads are.

/|/|ike

"Tumbleweed E-mail Firewall tumbleweed.com" made the following
annotations on 12/15/04 09:50:11
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==

RE: Roundtripping in Unicode

2004-12-14 Thread Lars Kristan

Title: RE: Roundtripping in Unicode

Arcane Jill wrote:
I've been following this thread for a while, and I've pretty
Thanks for bearing with me. And I hope my response will not discourage you from continuing to do so. That is, until I am banned from the list for heresy.

much got the
hang of the issues here. To summarize:

Unix filenames consist of an arbitrary sequence of octets,
excluding 0x00
and 0x2F. How they are /displayed/ to any given user depends
on that user's
locale setting. In this scenario, two users with different
locale settings
will see different filenames for the same file, but they will
still be able
to access the file via the filename that they see. These two
filenames will
be spelt identically in terms of octets, but (apparently)
differently when
viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we
I think such problems were already present with Shift-JIS. But already stated once why this was not noticed and will not repeat myself, unless explicitly asked to do so.

consider only one-byte-per-character encodings, then any
octet sequence is
valid in any locale. But UTF-8 introduces the possibility
that an octet
sequence might be invalid - a new concept for Unix. So if
you change your
locale to UTF-8, then suddenly, some files created by other
users might
appear to you to have invalid filenames (though they would
still appear
valid when viewed by the file's creator).

A specific example: if a file F is accessed by two different
users, A and B,
of whom A has set their locale to Latin-1, and B has set
their locale to
UTF-8, then the filename may appear to be valid to user A,
but invalid to
user B.

Lars is saying (and he's probably right, because he knows
more about Unix
than I) that user B does not necessarily have the right to
change the actual
octet sequence which is the filename of F, just to make it
appear valid to
user B, because doing so would stop a lot of things working
for user A (for
instance, A might have created the file, the filename might
be hardcoded in
a script, etc.). So Lars takes a Unix-like approach, saying
retain the
actual octet sequence, but feel free to try to display and
manipulate it as
if it were some UTF-8-like encoding in which all octet
sequences are valid.
And all this seems to work fine for him, until he tries to
roundtrip to
UTF-16 and back.

I'm not sure why anyone's arguing about this though -
Phillipe's suggestion
seems to be the perfect solution which keeps everyone happy. So...
Well, it doesn't. The rest of my comments will show you why.

...allow me to construct a specific example of what Phillipe
suggested only
generally:

DEFINITION - NOT-Unicode is the character repertoire
consisting of the
whole of Unicode, and 128 additional characters representing
integers in the
range 0x80 to 0xFF.
As long as we agree that the codepoints used to store the NOT-Unicode data are valid unicode codepoints. You noticed yourself that NOT-Unicode should roundtrip through UTF-16. Only valid Unicode codepoints can be safely passed through UTF-16.

OBSERVATION - Unicode is a subset of NOT-Unicode
But unfortunately data can pass from NOT-Unicode to Unicode. Some people think that this is terribly bad. One would think that by storing NOT-UTF-8 in NOT-UTF-16 would prevent data from crossing the boundary, but that is not so.

DEFINITION - NOT-UTF-8 is a bidirectional encoding between
a NOT-Unicode
character stream and an octet stream, defined as follows: if
a NOT-Unicode
character is a Unicode character then its encoding is the
UTF-8 encoding of
that character; else the NOT-Unicode character must represent
an integer, in
which case its encoding is itself. To decode, assume the next
NOT-Unicode
character is a Unicode character and attempt to decode from
the octet stream
using UTF-8; if this fails then the NOT-Unicode character is
an integer, in
which case read one single octet from the stream and return it.
More or less. You have not defined how to return the octet. It must be returned as a valid Unicode codepoint. And if a Unicode character is decoded, one must check if it is any of the codepoints used for this purpose and escape it. But only when decoding NON-UTF-8. Decoding from UTF-8 remains unchanged.

OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
Yes, that's the sanity check, because this is what we wanted to get.

OBSERVATION - NOT-Unicode characters which are Unicode
characters will be
encoded identically in UTF-8 and NOT-UTF-8
Unfortunately not so. Becase you started with the wrong assumption that NOT-UTF-8 data will not be stored in valid codepoints. But the fact that this observation is not true is not really a problem.

OBSERVATION - NOT-Unicode characters which are not Unicode
characters cannot
be represented

Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan

Peter Kirk scripsit:

 I think the problem here is that a Unix filename is a string of octets, 
 not of characters. And so it should not be converted into another 
 encoding form as if it is characters; it should be processed at a quite 
 different level of interpretation.

Unfortunately, that is simply a counsel of perfection.

Unix filenames are in general input as character strings, output as character
strings, and intended to be perceived as character strings.  The corner
cases in which this does not work are not sufficient to overthrow the
power and generality to be achieved by assuming it 99% of the time.

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it
truly is a trick, and in any case depends only on an ASCII interpretation.)

-- 
Income tax, if I may be pardoned for saying so, John Cowan
is a tax on income.  --Lord Macnaghten (1901)   [EMAIL PROTECTED]

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -
 NOT-UTF-16 - NOT-UTF-8

But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 -
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to exclude those subsequences of
non-characters which would form a valid UTF-8 fragment.

Unicode has the following property. Consider sequences of valid
Unicode characters: from the range U+..U+10, excluding
non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and
U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
in any UTF-n, and nothing else is expected from UTF-n.

With the exception of the set of non-characters being irregular and
IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
limit caused by UTF-16, this gives a precise and unambiguous set of
values for which encoders and decoders are supposed to work. Well,
except non-obvious treatment of a BOM (at which level it should be
stripped? does this include UTF-8?).

A variant of UTF-8 which includes all byte sequences yields a much
less regular set of abstract string values. Especially if we consider
that 1110 1011 1010 binary is not valid UTF-8, as much as
0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
order for a BOM to fulfill its role).

Question: should a new programming language which uses Unicode for
string representation allow non-characters in strings? Argument for
allowing them: otherwise they are completely useless at all, except
U+FFFE for BOM detection. Argument for disallowing them: they make
UTF-n inappropriate for serialization of arbitrary strings, and thus
non-standard extensions of UTF-n must be used for serialization.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 Hm, here lies the catch. According to UTC, you need to keep
 processing the UNIX filenames as BINARY data. And, also according
 to UTC, any UTF-8 function is allowed to reject invalid sequences.
 Basically, you are not supposed to use strcpy to process filenames.

No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not an UTF-8 function.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk

On 14/12/2004 11:32, Arcane Jill wrote:
I've been following this thread for a while, and I've pretty much got 
the hang of the issues here. To summarize:

I haven't followed everything, but here is my 2 cents worth.
I note that there is a real problem. I have had significant problems in 
Windows with files copied from other language systems. Sometimes for 
example these files are listed fine in Explorer but when I try to copy 
or delete them they are not found, presumably because the filename is 
being corrupted somewhere in the system and doesn't match.

Unix filenames consist of an arbitrary sequence of octets, excluding 
0x00 and 0x2F. How they are /displayed/ to any given user depends on 
that user's locale setting. In this scenario, two users with different 
locale settings will see different filenames for the same file, but 
they will still be able to access the file via the filename that they 
see. These two filenames will be spelt identically in terms of octets, 
but (apparently) differently when viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we 
consider only one-byte-per-character encodings, then any octet 
sequence is valid in any locale. But UTF-8 introduces the 
possibility that an octet sequence might be invalid - a new concept 
for Unix. So if you change your locale to UTF-8, then suddenly, some 
files created by other users might appear to you to have invalid 
filenames (though they would still appear valid when viewed by the 
file's creator).

This is not in fact a new concept. Some octet sequences which are valid 
filenames are invalid in a Latin-1 locale - for example, those which 
include octets in the range 0x80-0x9F, if Latin-1 means ISO 8859-1. 
Some of these octets are of course defined in Windows CP1252 etc, so a 
Unix Latin-1 system may have some interpretation for some of them; but 
others e.g. 0x81 have no interpretation in any flavour of Latin-1 as far 
as I know. So there is by no means a guarantee that every non-Unicode 
Unix locale has an interpretation of every octet, which implies that 
other octets are invalid.

Now no doubt many Unix filename handling utilities ignore the fact that 
some octets are invalid or uninterpretable in the locale, because they 
handle filenames as octet strings (with 0x00 and 0x2F having special 
interpretations) rather than as locale-dependent character strings. But 
these routines should continue to work in a UTF-8 locale, as they make 
no attempt to interpret any octets other than 0x00 and 0x2F.

A specific example: if a file F is accessed by two different users, A 
and B, of whom A has set their locale to Latin-1, and B has set their 
locale to UTF-8, then the filename may appear to be valid to user A, 
but invalid to user B.

Lars is saying (and he's probably right, because he knows more about 
Unix than I) that user B does not necessarily have the right to change 
the actual octet sequence which is the filename of F, just to make it 
appear valid to user B, because doing so would stop a lot of things 
working for user A (for instance, A might have created the file, the 
filename might be hardcoded in a script, etc.). So Lars takes a 
Unix-like approach, saying retain the actual octet sequence, but feel 
free to try to display and manipulate it as if it were some UTF-8-like 
encoding in which all octet sequences are valid. And all this seems 
to work fine for him, until he tries to roundtrip to UTF-16 and back.

I think the problem here is that a Unix filename is a string of octets, 
not of characters. And so it should not be converted into another 
encoding form as if it is characters; it should be processed at a quite 
different level of interpretation.

Of course a system is free to do what it wants internally.
I'm not sure why anyone's arguing about this though - Phillipe's 
suggestion seems to be the perfect solution which keeps everyone 
happy. So...

...allow me to construct a specific example of what Phillipe suggested 
only generally:

...
This would appear to solve Lars' problem, and because the three 
encodings, NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be 
UTFs, no-one need get upset.

All of this is ingenious, and may be useful for internal processing 
within a Unix system, and perhaps even for interaction between 
cooperating systems. But NOT-Unicode is not Unicode (!) and so Unicode 
should not be expected to standardise it.

I can see that there may be a need for a protocol for open exchange of 
Unix-like filenames. But these filenames should be treated as binary 
data (which may or may not be interpretable in any one locale) and 
encoded as such, rather than forced into the mould of Unicode characters 
which it does not fit.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Roundtripping in Unicode

2004-12-14 Thread Kenneth Whistler

Lars said:

 According to UTC, you need to keep processing
 the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8
 function is allowed to reject invalid sequences. Basically, you are not
 supposed to use strcpy to process filenames.

This is a very misleading set of statements.

First of all, the UTC has not taken *any* position on the
processing of UNIX filenames. That is an implementation issue
outside the scope of what the UTC normally deals with, and I
doubt that it will take a position on the issue.

It is erroneous to imply that the UTC has indicated that you
are not supposed to use strcpy to process filenames. It has
done nothing of the kind, and I don't know of any reason why
anyone should think otherwise. I certainly use strcpy to process
filenames, UTF-8 or not, and expect that nearly every implementer
on the list has done so, too.

Any process *interpreting* a UTF-8 code unit sequences as
characters can and should recognize invalid sequences, but
that is a different matter.

If I pass the byte stream 0x80 0xFF 0x80 0xFF 0x80 0xFF to
a process claiming conformance to UTF-8 and ask it to intepret
that as Unicode characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

But there is *nothing* new here.

If I pass the byte stream 0x80 0xFF 0x80 0xFF 0x80 0xFF to
a process claiming conformance to Shift-JIS and ask it to intepret
that as JIS characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

Unicode did not invent the notion of conformance to character
encoding standards. What is new about Unicode is that it has
*3* interoperable character encoding forms, not just one, and
all of them are unusual in some way, because they are designed
for a very, very large encoded character repertoire, and
involve multibyte and/or non-byte code unit representations.

 Well, I just hope noone will listen to them and modify strcpy and strchr to
 validate the data when running in UTF-8 locale and start signalling
 something (really, where and how?!). The two statements from UTC don't make
 sense when put together. Unless we are really expected to start building
 everything from scratch.

This is bogus. The UTC has never asked anyone to modify strcpy
and strchr. What anyone implementing UTF-8 using a C runtime
library (or similar set of functions) has to do is completely
comparable to what they have to do for supporting any other
multibyte character encoding on such systems. If your system
handles euc-kr, euc-tw, and/or euc-jp correctly, then adding
UTF-8 support is comparable, in principle and in practice.

--Ken

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk

Arcane Jill [EMAIL PROTECTED] writes:

 If so, Marcin, what exactly is the error, and whose fault is it?

It's an error to use locales with different encodings on the same
system.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Lars Kristan [EMAIL PROTECTED] writes:
Hm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basically, you are not supposed to use strcpy to process filenames.
No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not an UTF-8 function.
Correct: [wc]strcpy() handles string instances, but not all string 
instances are plain-text, so they don't need to obey to UTF encoding rules 
(they just obey to the convention of null-byte termination, with no 
restriction on the string length, measured as a size in [w]char[_t] but not 
as a number of Unicode characters).

This is true for the whole standard C/C++ string libraries, as well as in 
Java (String and Char objects or native char datatype), and as well in 
almost all string handling libraries of common programming languages.

A locale defined as UTF-8 will experiment lots of problems because of 
the various ways applications will behave face to encoding errors 
encountered in filenames: exceptions thrown aborting the program, 
substitution by ? or U+FFFD causing wrong files to be accessed, some files 
not treated because their name was considered invalid althoug they were 
effectively created by some user of another locale...

Filenames are identifiers coded as strings, not as plain-text (even if most 
of these filename strings are plain-text).

The solution if then to use a locale based on a relaxed version of UTF-8 
(some spoke about defining a NOT-UTF-8 and NOT-UTF-16 encodings to allow 
any sequence of code units, but nobody has thought about how to make 
NOT-UTF-8 and NOT-UTF-16 mutually fully reversible; now add NOT-UTF-32 
to this nightmare and you will see that NOT-UTF-32 needs to encode 2^32 
distinct NOT-Unicode-codepoints, and that they must map bijectively to 
exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not 
found a solution to this problem, and I don't know if such solution even 
exists; if such solution exists, it should be quite complex...).

Re: Roundtripping in Unicode

2004-12-14 Thread Doug Ewell

 Unicode did not invent the notion of conformance to character
 encoding standards. What is new about Unicode is that it has
 *3* interoperable character encoding forms, not just one, and
 all of them are unusual in some way, because they are designed
 for a very, very large encoded character repertoire, and
 involve multibyte and/or non-byte code unit representations.

Geez, even when I was going through my stage of inventing wild and crazy
new UTF's, I made sure they were 100% convertible to and from code
points.  How could they not be?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode






 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED]] On Behalf Of Mike Ayers
 Sent: Tuesday, December 14, 2004 3:29 PM


 The rule is No zero, no eight. 


 No zero, no forty seven.


 My bad.




/|/|ike




"Tumbleweed E-mail Firewall tumbleweed.com" made the following
 annotations on 12/14/04 16:25:28
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

From: Peter Kirk [mailto:[EMAIL PROTECTED]]
Sent: Tuesday, December 14, 2004 3:37 PM

Thanks for the clarification. Perhaps the bifurcation could
be better expressed as into strings of characters as defined
by the locale and strings of non-null octets. Then I could
re-express this as the only safe way out of this mess is
never to process filenames as strings of characters as
defined by the locale.

That would not be correct for ISO 8859 locales, though (amongst others). That's why I specified UTF-8. Although other locales may have the problem of invalid sequences, we're only interested in UTF-8 here.

Well, I was assuming that when John Cowan implied that 0x08
was permitted, and Jill wrote Unix filenames consist of an
arbitrary sequence of octets, excluding 0x00 and 0x2F, they
were speaking from the appropriate orifices.

Correct, and my bad. I got thrown off by John's:

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it truly
is a trick, and in any case depends only on an ASCII interpretation.)

which I misinterpreted to mean that 0x08 was a forbidden character. It isn't - just real hard to type!

/|/|ike

"Tumbleweed E-mail Firewall tumbleweed.com" made the following
annotations on 12/14/04 16:24:51
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==

RE: Roundtripping in Unicode

2004-12-14 Thread Arcane Jill

I've been following this thread for a while, and I've pretty much got the 
hang of the issues here. To summarize:

Unix filenames consist of an arbitrary sequence of octets, excluding 0x00 
and 0x2F. How they are /displayed/ to any given user depends on that user's 
locale setting. In this scenario, two users with different locale settings 
will see different filenames for the same file, but they will still be able 
to access the file via the filename that they see. These two filenames will 
be spelt identically in terms of octets, but (apparently) differently when 
viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we 
consider only one-byte-per-character encodings, then any octet sequence is 
valid in any locale. But UTF-8 introduces the possibility that an octet 
sequence might be invalid - a new concept for Unix. So if you change your 
locale to UTF-8, then suddenly, some files created by other users might 
appear to you to have invalid filenames (though they would still appear 
valid when viewed by the file's creator).

A specific example: if a file F is accessed by two different users, A and B, 
of whom A has set their locale to Latin-1, and B has set their locale to 
UTF-8, then the filename may appear to be valid to user A, but invalid to 
user B.

Lars is saying (and he's probably right, because he knows more about Unix 
than I) that user B does not necessarily have the right to change the actual 
octet sequence which is the filename of F, just to make it appear valid to 
user B, because doing so would stop a lot of things working for user A (for 
instance, A might have created the file, the filename might be hardcoded in 
a script, etc.). So Lars takes a Unix-like approach, saying retain the 
actual octet sequence, but feel free to try to display and manipulate it as 
if it were some UTF-8-like encoding in which all octet sequences are valid. 
And all this seems to work fine for him, until he tries to roundtrip to 
UTF-16 and back.

I'm not sure why anyone's arguing about this though - Phillipe's suggestion 
seems to be the perfect solution which keeps everyone happy. So...

...allow me to construct a specific example of what Phillipe suggested only 
generally:

DEFINITION - NOT-Unicode is the character repertoire consisting of the 
whole of Unicode, and 128 additional characters representing integers in the 
range 0x80 to 0xFF.

OBSERVATION - Unicode is a subset of NOT-Unicode
DEFINITION - NOT-UTF-8 is a bidirectional encoding between a NOT-Unicode 
character stream and an octet stream, defined as follows: if a NOT-Unicode 
character is a Unicode character then its encoding is the UTF-8 encoding of 
that character; else the NOT-Unicode character must represent an integer, in 
which case its encoding is itself. To decode, assume the next NOT-Unicode 
character is a Unicode character and attempt to decode from the octet stream 
using UTF-8; if this fails then the NOT-Unicode character is an integer, in 
which case read one single octet from the stream and return it.

OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
OBSERVATION - NOT-Unicode characters which are Unicode characters will be 
encoded identically in UTF-8 and NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot 
be represented in UTF-8

DEFINITION - NOT-UTF-16 is a bidirectional encoding between a NOT-Unicode 
character stream and a 16-bit word stream, defined as follows: if a 
NOT-Unicode character is a Unicode character then its encoding is the UTF-16 
encoding of that character; else the NOT-Unicode character must represent an 
integer, in which case its encoding is 0xDC00 plus the integer. To decode, 
if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the 
NOT-Unicode character is the integer whose value is (word16 - 0xDC00), else 
the NOT-Unicode character is the Unicode character obtained by decoding as 
if UTF-16.

OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 - 
NOT-UTF-16 - NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are Unicode characters will be 
encoded identically in UTF-16 and NOT-UTF-16

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot 
be represented in UTF-16

DEFINITION - NOT-UTF-32 is a bidirectional encoding between a NOT-Unicode 
character stream and a 32-bit word stream, defined as follows: if a 
NOT-Unicode character is a Unicode character then its encoding is the UTF-32 
encoding of that character; else the NOT-Unicode character must represent an 
integer, in which case its encoding is 0xDC00 plus the integer. To 
decode, if the next 32-bit word is in the range 0xDC80 to 0xDCFF 
then the NOT-Unicode character is the octet whose value is (word32 - 
0xDC00), else the NOT-Unicode character is the Unicode character 
obtained by decoding as if UTF-16.

OBSERVATION - Roundtripping is possible in the directions NOT-UTF-8 - 
NOT-UTF-32 -

UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)

2004-12-14 Thread Edward H. Trager

On Tuesday 2004.12.14 12:50:43 -, Arcane Jill wrote:
 If I have understood this correctly, filenames are not in a locale, they 
 are absolute. Users, on the other hand, are in a locale, and users view 
 filenames. The same filename can look different to two different users. 
 To user A (whose locale is Latin-1), a filename might look valid; to user B 
 (whose locale is UTF-8), the same filename might look invalid.

Correct. The problem will however be limited to the accented
Latin characters present in ISO-8859-1 beyond the ASCII set.  The basic Latin
alphabet in the ASCII set
at the beginning of both ISO-8859-1 and UTF-8 will appear unchanged to both 
users (UTF-8 user looking at Latin-1's home directory, or Latin-1 looking at
UTF-8's home directory).  So both users could probably guess the filename
they were looking at.  For example, here is a file on my local machine,
a Linux box with the locale set to LANG=en_US.UTF-8:

  déclaration_des_droits.utf8

The accented e in déclaration appears correctly under the UTF-8 locale.

I then copied this file (using scp) over to an older Sun Solaris box which I do 
not administer,
so I have to live with the C POSIX locale that they have got that machine
set to.  Now, when I
view the file names in a terminal (where the terminal emulator is set to
the same locale), I see:

  d??claration_des_droits.utf8

The terminal, being set to interpret the legacy locale, does not know 
how to interpret the two bytes that are used for the UTF-8 é.
Still, I can guess that the first word should be déclaration.

The solution, as has been pointed out, is for everyone to move to
UTF-8 locales.  In the Linux and Unix world, this is already happening
for the most part.  Solaris 10 now defaults to a UTF-8 locale, at least
when set to English.  Both SuSE and Redhat default to UTF-8 locales
for most language and script environments.  And (open source) tools exist for
converting file names from one encoding to another encoding on Linux
and Unix systems.  A group of Japanese developers is working on an NLS 
implementation
for the BSDs like OpenBSD which are currently stuck with nothing but the C
POSIX locale.  I think the name of that project is Citrus.

-- Ed Trager

   

 
 Is that right, Lars?
 
 If so, Marcin, what exactly is the error, and whose fault is it?
 
 Jill
 
 -Original Message-
 
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 
 Behalf Of Marcin 'Qrczak' Kowalczyk
 
 Sent: 13 December 2004 14:59
 
 To: [EMAIL PROTECTED]
 
 Subject: Re: Roundtripping in Unicode
 
 Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk

On 14/12/2004 17:47, John Cowan wrote:
Peter Kirk scripsit:
 

I think the problem here is that a Unix filename is a string of octets, 
not of characters. And so it should not be converted into another 
encoding form as if it is characters; it should be processed at a quite 
different level of interpretation.
   

Unfortunately, that is simply a counsel of perfection.
Unix filenames are in general input as character strings, output as character
strings, and intended to be perceived as character strings.  The corner
cases in which this does not work are not sufficient to overthrow the
power and generality to be achieved by assuming it 99% of the time.
 

This is a design flaw in Unix, or in how it is explained to users. Well, 
Lars wrote Basically, you are not supposed to use strcpy to process 
filenames. I'm not sure if that is his opinion or someone else's, but 
the only safe way out of this mess is never to process filenames as strings.

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it
truly is a trick, and in any case depends only on an ASCII interpretation.)
 

This may be called a trick but it looks like it could very easily be a 
security hole. For example, a filename 0x41 0x08 0x42 will be displayed 
the same as just 0x42, in a Latin-1 or UTF-8 locale. Your friend's trick 
has become an open door for spoofers.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]] On Behalf Of Peter Kirk
Sent: Tuesday, December 14, 2004 11:32 AM

This is a design flaw in Unix, or in how it is explained to
users. Well, Lars wrote Basically, you are not supposed to
use strcpy to process filenames. I'm not sure if that is his
opinion or someone else's, but the only safe way out of this
mess is never to process filenames as strings.

As mentioned by Kenneth, Lars was speaking from the wrong orifice when he said that.

Also, it appears that the term string is being used too much and without qualification. The entire focus of this thread is on what happens when unqualified bytes (filenames) get qualified (by locale), so it would behoove us all to qualify all the strings we're talking about. For instance, Peter's last clause above bifurcates into:

...but the only safe way out of this mess is never to process filenames as UTF-8 strings.

and:

...but the only safe way out of this mess is always to process filenames as opaque C strings.

which was mentioned early on in this thread, but Lars does not wish to do this.

This may be called a trick but it looks like it could very
easily be a security hole. For example, a filename 0x41 0x08
0x42 will be displayed the same as just 0x42, in a Latin-1 or
UTF-8 locale. Your friend's trick has become an open door for
spoofers.

Exactly why 0x08 was banned in filenames, as I recall.

/|/|ike

"Tumbleweed E-mail Firewall tumbleweed.com" made the following
annotations on 12/14/04 13:16:29
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==

Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy

From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Arcane Jill [EMAIL PROTECTED] writes:
If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
More simply, I think that it's an error to have the encoding part of any 
locale... The system should not depend on them, and for critical things like 
filesystem volumes, the encoding should be forced by the filesystem itself, 
and applications should mandatorily follow the filesystem rules.

Now think about the web itself: it's really a filesystem, with billions 
users, or trillion applications using simultaneously hundreds or thousands 
of incompatible encodings... Many resources on the web seem to have valid 
URLs for some users but not for others, until URLs are made independant to 
any user locale, and then not considered as encoded plain-text but only as 
strings of bytes.

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED]] On Behalf Of Philippe Verdy
Sent: Tuesday, December 14, 2004 2:47 PM

More simply, I think that it's an error to have the encoding
part of any locale... The system should not depend on them,
and for critical things like filesystem volumes, the encoding
should be forced by the filesystem itself, and applications
should mandatorily follow the filesystem rules.

It doesn't, it is, and they do.

The rule is No zero, no eight.

The problem is that these valid filenames can't all be translated as valid UTF-8 Unicode.

Now think about the web itself: it's really a filesystem,

No. It isn't.

with billions users, or trillion applications using
simultaneously hundreds or thousands of incompatible
encodings... Many resources on the web seem to have valid
URLs for some users but not for others, until URLs are made
independant to any user locale, and then not considered as
encoded plain-text but only as strings of bytes.

I thought that URLs were specified to be in Unicode. Am I mistaken?

/|/|ike

P.S. [OT} Note the below autoattachment. I recall that we discussed such clauses on the list some time ago with regard to their legal standing. Does anyone have a pointer to substantive material on the subject? I've gotten curious again, 'natch.

"Tumbleweed E-mail Firewall tumbleweed.com" made the following
annotations on 12/14/04 15:31:51
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed. If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==

Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan

Mike Ayers scripsit:

   I thought that URLs were specified to be in Unicode.  Am I mistaken?

You are.  URLs are specified to be in *ASCII*.  There is a %-encoding
hack that allows you to represent random-octet filenames as ASCII.
Some people (including me) think it's a good idea to use this hack
to specify non-ASCII characters with double encoding (first as UTF-8,
then with the %-hack), but the URI Syntax RFC doesn't say.

-- 
John Cowan  [EMAIL PROTECTED]
http://www.reutershealth.comhttp://www.ccil.org/~cowan
Humpty Dump Dublin squeaks through his norse
Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
Humpty Dump Dublin's grandada of all rogues.  --Cousin James

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

Marcin 'Qrczak' Kowalczyk wrote:
You are trying to stick with processing byte sequences, carefully
preserving the storage format instead of preserving the meaning in
terms of Unicode characters. This leads to less robust software
which is not certain about the encoding of texts it processes and
thus can't apply algorithms like case mapping without risking doing
a meaningless damage to the text.
I am not proposing that this approach is better or that it should be used generally. What I am saying is that this approach is, unfortunately, needed in order to make the transition easier. The fact is that currently data exists that cannot be converted easily. An over-robust software, in my opinion, can be impratcical and might not be accepted with open hands. We should acknowledge the fact that some products will chose a different path. You can say these applications will be less robust, but we should really give the the users a choice and let them decide what they want.

Conversion should signal an error by default. Replacing errors by
U+FFFD should be done only when the data is processed purely for
showing it to the user, without any further processing, i.e. when it's
better to show the text partially even if we know that it's corrupted.
I think showing it to the user is not the only case when you need to use U+FFFD. A text viewer could do the replacement when reading the file and do further processing in Unicode. But an editor cannot. Keeping the text in original binary form is far from practical and opens numerous possibilities for bugs. But, as I once already said, you can do it with UTF-8, you simply keep the invalid sequences as they are, and really handle them differently only when you actually process them or display them. But you cannot do this in UTF-16, since you cannot preserve all the data.

As for signalling - in some cases signalling is impossible. Listing files in a directory should not signal anything. It MUST return all files and it should also return them in a way that this list can be used to access each of the files.

Either you do everything in UTF-8, or everything in UTF-16. Not
always, but typically. If comparisons are not always done in the
same UTF, then you need to validate. And not validate while
converting, but validate on its own. And now many designers will
remember that they didn't. So, all UTF-8 programs (of that kind)
will need to be fixed. Well, might as well adopt my broken
conversion and fix all UTF-16 programs. Again, of that kind, not all
in general, so there are few. And even those would not be all
affected. It would depend on which conversion is used where. Things
could be worked out. Even if we would start changing all the
conversions. Even more so if a new conversion is added and only used
when specifically requested.

I don't understand anything of this.
Let's start with UTF-8 usernames. This is a likely scenario, since I think UTF-8 will typically be used in network communication. If you store the usernames in UTF-16, the conversion will signal an error and you will not have any users with invalid UTF-8 sequences nor will any invalid sequence be able to match any user. If you later on start comparing users somewhere else, in UTF-8, then you must not only strcmp them, but also validate each string. This is just a fact and I am not complaining about it.

In the opposite case, if you would have UTF-8 storage and UTF-16 communication, and any comparisons would be done in UTF-16, you again need to validate the UTF-16 strings.

Now I am supposing that there are such applications already out there. And that some of them do not validate (or validate only in conversion, but not when comparing or otherwise processing native strings).

They should be analyzed and fixed. At the time I wrote the above paragraph, I though UTF-16 programs don't need to validate, but that is not true, so all the applications need to be fixed, if they are not already validating.

Now, suppose my 'broken' conversion is standardized. As an option, not for UTF-16 to UTF-8 conversion. If you don't start using it, the existing rules apply.

The interesting thing is that if you do start using my conversion, you can actually get rid of the need to validate UTF-8 strings in the first scenario. That of course means you will allow users with invalid UTF-8 sequences, but if one determines that this is acceptable (or even desired), then it makes things easier. But the choice is yours.

For the second scenario, things do indeed become a bit more complicated. But can be solved. And there is still a number of choices you can make about the level of validation. And, again, one of them is that you keep using the existing conversion and the existing validation.

I cannot afford not to access the files.

Then you have two choices:
- Don't use Unicode.
As soon as a Windows system enters the picture, it is practically impossible

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 But, as I once already said, you can do it with UTF-8, you simply
 keep the invalid sequences as they are, and really handle them
 differently only when you actually process them or display them.

UTF-8 is painful to process in the first place. You are making it
even harder by demanding that all functions which process UTF-8 do
something sensible for bytes which don't form valid UTF-8. They even
can't temporarily convert it to UTF-32 for internal processing for
convenience.

 Listing files in a directory should not signal anything. It MUST
 return all files and it should also return them in a way that this
 list can be used to access each of the files.

Which implies that they can't be interpreted as UTF-8.

By masking an error you are not encouraging users to fix it.
Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

 Let's start with UTF-8 usernames. This is a likely scenario, since I
 think UTF-8 will typically be used in network communication. If you
 store the usernames in UTF-16, the conversion will signal an error
 and you will not have any users with invalid UTF-8 sequences nor
 will any invalid sequence be able to match any user. If you later on
 start comparing users somewhere else, in UTF-8, then you must not
 only strcmp them, but also validate each string. This is just a fact
 and I am not complaining about it.

If usernames are supposed to be UTF-8, and in fact they are not,
then it's normal that some software will signal an error instead
of processing them. The proper way is to fix the username database,
not to change programs.

 The interesting thing is that if you do start using my conversion,
 you can actually get rid of the need to validate UTF-8 strings
 in the first scenario. That of course means you will allow users
 with invalid UTF-8 sequences, but if one determines that this is
 acceptable (or even desired), then it makes things easier. But the
 choice is yours.

For me it's not acceptable, so I will not support declaring it valid.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

Philippe Verdy wrote:
An implementation that uses UTF-8 for valid string could use
the invalid
ranges for lead bytes to encapsultate invalid byte values.
Note however that
invalid bytes you would need to represent have 256 possible
values, but the
UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1)
each for 64
codes, if you want to use an encoding on two bytes. The
alternative would be
to use the UTF-8 lead byte values which have initially been
assigned to byte
sequences longer than 4 bytes, and that are now unassigned/invalid in
standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
Here also it will be a private encoding, that should NOT be
named UTF-8, and
the application should clearly document that it will not only
accept any
valid Unicode string, but also some invalid data which will have some
roundtrip compatibility.
Now you are devising an algorithm to store invalid sequences with other invalid sequences. In UTF-8. Why not simply stick with the original invalid sequences?

And the whole purpose of what I am trying to do is to get VALID sequences. In order to be able to store and manipulate with Unicode strings.

So what is the problem: suppose that the application,
internally, starts to
generate strings containing any occurences of such private
sequences, then
it will be possible for the application to generate on its
output a byte
stream that would NOT have roundtrip compatibility, back to
the private
representation. So roundtripping would only be guaranteed for streams
converted FROM an UTF-8 where some invalid sequences are
present and must be
preserved by the internal representation. So the
transformation is not
bijective as you would think, and this potentially creates
lots of possible
security issues.
Yes, it does. An application that uses my approach needs to be designed accordingly. *IF* the security issues apply. For a UTF-16 text editor this probably doesn't apply (in terms of data, not filenames). And this is just an example, with a text editor you can perhaps force the user to select a different encoding, but there are cases where this cannot be done, but data still needs to be preserved.

So far, many people have suggested that there is no need to preserve 'invalid data'. After some argumentation and a couple of examples, the need is acknowledged. But then they question the way it is done. They see the codepoint approach as unsuitable or unneeded. And suggest using some form of escaping. Now, any escaping has exactly the same problems you are mentioning, and some on top. And is actually representing invalid data with valid codepoints (except more than one per invalid byte), which you say is a definite no-no.

And on top of all, the approach I am proposing is NOT intended to be used everywhere. It should only be used when interfacing to a system that cannot guarantee valid UTF-8, but does use UTF-8. For example, a UNIX filesystem. And, actually, if the security is entirely done by the filesystem, then it doesn't even matter if two UTF-16 strings map to the same filename. They will open the same file. Or be both denied. Which is exactly what is required. A Windows filesystem is case preserving but case insensitive. Did it ever bother you that you can use either upper case or lower case filename to open a file? Does it introduce security issues? Typically no, because you leave the security to the filesystem. And those checks are always done in the same UTF.

This is a simple example of something that doesn't even need to be fixed. There are cases where validation would really need to be fixed. But then again, only if you use the new conversion. If you don't, your security remains exactly where it is today.

We should be analyzing the security aspects. Learning where it can break, and in which cases. Get to know the enemy. And once we understand that things are manageable and not as frigtening as it seems at first, then we can stop using this as an argument against introducing 128 codepoints. People who will find them useful should and will bother with the consequences. Others don't need to and can roundtrip them as today.

So, interpreting the 128 codepoints as 'recreate the original byte sequence' is an option. If you convert from UTF-16 to UTF-8, then you do exactly as you do now. Even I will do the same where I just want to represent Unicode in UTF-8. I will only use this conversion in certain places. The fact that my conversion actually produces UTF-8 from most of Unicode points does not mean it produced UTF-8. The result is just a byte sequence. The same one that I started with when I was replacing invalid sequences with the 128 codepoints. And this is not limited to conversion from 'byte sequence that is mostly UTF-8' to UTF-16. I can (and even should) convert from this byte sequence to UTF-8. Preserving most of it and replacing each byte of invalid sequences

RE: RE: Roundtripping in Unicode

Title: RE: RE: Roundtripping in Unicode

Philippe VERDY wrote:
If a source sequence is invalid, and you want to preserve it,
then this sequence must remain invalid if you change its encoding.
So there's no need for Unicode to assign valid code points
for invalid source data.
Using invalid UTF-16 sequences to represent invalid UTF-8 sequences is a known approach (UTF-8B, if I remember correctly). But this is then not UTF-16 data so you don't gain much. The data is at risk of being rejeted or filtered out at any time. And that misses the whole point.

Specifically, unpaired surrogates that are used in the UTF-8B conversion have additional risks, but that is not the issue now.

Using PUA space or some unassigned space in Unicode to
represent invalid sequences present in a source text will be
a severe design error in all cases, because that conversion
will not be bejective and could map invalid sequences to
valid ones without further notice, changing the status of the
original text which should be kept as incorrectly encoded,
until explicitly corrected or until the source text is
reparsed with another more appriate encoding.
Again, I am not changing the UTF-8 definition. In places where I do decide to interpret the 128 codepoints differently, it is my responsibility to understand the risks. If there is a risk, I can prevent it. If there is no risk, then I don't need to do anything. Thanks for the warning, but may I be allowed to decide whether it applies to me or not? Or will you insist that such codepoints should not be assigned to protect the innocent? Let's stop producing knives. They're dangerous.

(In fact I also think that mapping invalid sequences to
U+FFFD is also an error, because U+FFFD is valid, and the
presence of the encoding error in the source is lost, and
will not throw exceptions in further processings of the
remapped text, unless the application constantly checks for
the presence of U+FFFD in the text stream, and all modules in
the application explicitly forbids U+FFFD within its interface...)
Generally, no, most definitely not. Your concern is ONLY valid in security related processing. In data processing, you must preserve the data. U+FFFD is a valid codepoint. A certain application may treat it as special, just as another might treat '/' as special. But you are almost suggesting that U+FFFD is invalid and should be signalled all over. When you realize that U+FFFD is just a codepoint, then you will also understand that codepoints for invalid sequences must also be codepoints. Valid codepoints.

I think my ideas are often misunderstood because I speak mainly of using these codepoints for preserving the invalid sequences. Leading to conclusion that I want to corrupt UTF-8. But that is not so. For one, this mechanism is not intended to replace neither decoding UTF-8, nor encoding UTF-8. It is to be used on interfaces that cannot guarantee pure UTF-8 data. And UTF-8 is just an example, one can use the replacement codepoints for preserving bytes in other encodings, for example a 0xA5 in Latin 3.

Lars

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk

Lars Kristan [EMAIL PROTECTED] writes:

 And once we understand that things are manageable and not as
 frigtening as it seems at first, then we can stop using this as an
 argument against introducing 128 codepoints. People who will find
 them useful should and will bother with the consequences. Others
 don't need to and can roundtrip them as today.

A person who is against them can't ignore a motion to introduce them,
because if they are introduced, other people / programs will start
feeding our programs arbitrary byte sequences labeled as UTF-8
expecting them to accept the data.

 So, interpreting the 128 codepoints as 'recreate the original byte
 sequence' is an option.

Which guarantees that different programs will have different view of
the validity and meaning of the same data labeled by the same encoding.
Long live standarization.

 Even I will do the same where I just want to represent Unicode in
 UTF-8. I will only use this conversion in certain places.

So it's not just different programs, but even the same program in
different places. Great...

 The fact that my conversion actually produces UTF-8 from most of
 Unicode points does not mean it produced UTF-8.

Increasing the number of encodings means more opportunities of
mislabeling and using wrong libraries to process data (as it works
in most of cases and thus the error is not detected immediately)
and harder life for programs which aim at supporting all data.

Think further than the immediate moment where many people are
performing a transition form something to UTF-8. Look what happened
with the interpretation of HTML in web browsers.

If the standard from the beginning stood firmly at disallowing
guessing what a malformed HTML was supposed to mean, then people
would learn how to produce correct HTML and the interpretation would
be unambiguous. But browsers tried to accept arbitrary contents and
interpret parts of HTML they found there, guessing how errors should
be resolved, being friendly to careless webmasters. The effect is
that too often they submitted a webpage after checking that it works
in their browser, but in fact it had basic syntax errors. Other
browsers interpreted the errors differently, and the page was
inaccessible or looked badly.

When designing XML, they learned from this mistake:
http://www.xml.com/axml/target.html#dt-fatal
http://www.xml.com/axml/notes/Draconian.html

That's why people here reject balkanization of UTF-8 by introducing
variations with subtle differences, like Java-modified UTF-8.

 Inaccessible filenames are something we shouldn't accept. All your
 discussion of non-empty empty directories is just approaching the problem
 from the wrong end. One should fix the root cause, not consequences.

The root cause is that users and programs use different encodings in
different places, and thus Unix filenames can't be unambiguously and
context-freely interpreted as character sequences.

Unfortunately it's hard to fix.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/

Re: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY

Lars Kristan wrote: What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status.
You don't need to do that. No Unicode application must assign semantics to unassigned codepoints.
If a source sequence is invalid, and you want to preserve it,then this sequencemust remain invalid if you change its encoding.
So there's no need for Unicode to assign valid code points for invalid source data.
There's enough space *assigned* as invalid (or assigned to non-characters) in all UT forms, that allow an application to create a local conversion scheme which will perform a bijective conversion of invalid sequences:
- for example in UTF-8: trailing bytes 0x80 to 0xBFisolated or in excess, or even the invalid lead bytes 0xF8 to 0xFF
- for example in UTF-16: 0XFFFE, 0x
- for example in UTF-32: same as UTF-16, plus all code units above 0x10
Using PUA space or some unassigned space in Unicode to represent invalid sequences present in a source textwill be a severe designerror in all cases, because that conversion will not be bejective and could map invalid sequences to valid ones without further notice, changing the status of the original text which should be kept as incorrectly encoded, until explicitly corrected or until the source text is reparsed with another more appriate encoding.
(In fact I also think that mapping invalid sequences to U+FFFD is also an error, because U+FFFD is valid, and the presence of the encoding error in the sourceis lost, and will not throw exceptions in further processings of the remapped text, unless the application constantly checks for the presence of U+FFFD in the text stream, and all modules in the application explicitly forbids U+FFFD within its interface...)

Re: Roundtripping in Unicode

2004-12-13 Thread Mark Davis

Ken is absolutely right. It would be theoretically possible to add 128 code
points that would allow one to roundtrip a bytestream after passing through
a UTF-8 = UTF-32 conversion. (For that matter, it would be possible to add
2048 code points that would allow the same for a 16-bit data stream.)

However, these new code points would really be no better than private use
code points, since their interpretation would depend entirely on whatever
was assumed to be the interpretation of the original bytestream. If X
converted a bytestream that was assumed to be a mixture of 8858-7 with UTF-8
into Unicode with these new characters, and handed it off to Y, who
converted the bytestream back assuming that the odd bytes were to be
iso-8859-9, you would get data corruption. X and Y would have to agree on
the interpretation of these odd bytes to avoid that corruption, so it is
really no different than private use (where they also have to agree on the
interpretation).

Mark

- Original Message - 
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, December 13, 2004 13:04
Subject: RE: Roundtripping in Unicode


 Lars Kristan stated:

  I said, the choice is yours. My proposal does not prevent you from doing
it
  your way. You don't need to change anything and it will still work the
way
  it worked before. OK? I just want 128 codepoints so I can make my own
  choice.

 You have them: U+EE80..U+EEFF, which are yours to use (or abuse)
 in an application as you see fit. Just don't expect others outside
 your application to interpret them as you do.

  And once and for all, you can treat those 128 codepoints just as you
  do today.

 A number of people on the list have patiently explained why what
 you are proposing to do fundamentally breaks UTF-8 and its
 relationship to other Unicode encoding forms.

 The chances that you will get the standard extended to incorporate
 these 128 code points and define their mapping to invalid byte
 values in UTF-8 is somewhere between zilch, nada, and nil.

 --Ken

RE: RE: RE: Roundtripping in Unicode

Title: RE: RE: RE: Roundtripping in Unicode

Philippe VERDY wrote:
I don't think I miss the point. My suggested approach to
perform roundtrip conversions between UTF's while keeping all
invalid sequences as invalid (for the standard UTFs), is much
less risky than converting them to valid codepoints (and by
consequence to valid code units, because all valid code
points need valid code units in UTF encoding forms).

I still do think you are missing the point. About two years ago I started a similar thread. At that time I was pursuing the use of UTF-8B conversion, which uses one invalid sequence to represent another. It uses unpaired low surrogates. It works rather well, but one of the readers alerted me that I cannot expect that a Unicode database will be able (or, rather, willing) to process such data. Since I am not in a habit of writing every piece of the code myself (or by my team for that matter), I chose to use a third party database. The data that I have is mainly UTF-8, and users expect it to be interpreted as such. But are not expecting purism in the form of rejecting data (filenames) which contain invalid sequences. I am thankful to the person that pointed this out, and I have moved to using PUA. The rest of the responses were much like what I am getting now. Useless. Telling me to reject invalid sequences, telling me to rewrite everything and treat the data as binary. Or use an escaping technique, forgetting that everything they find wrong about the codepoint approach is also true for escaping. Except that escaping has a lot of overhead and that there is an actual risk of those escaping sequences being present in today's files. Not the ones on UNIX, but the ones on Windows. It should work both ways.

The application doing that just preserves the original byte
sequences, for its internal needs, but will not expose to
other applications or modules such invalid sequences without
the same risks: these other modules need their own strategy,
and their strategy could simply be rejecting invalid
sequences, assuming that all other valid sequences are
encoding valid codepoints (this is the risk you take with
your proposal to assign valid codepoints to invalid byte
sequences in a UTF-8 stream, and a module that would
implement your proposal would remove important security features).
Only applications that do use the new conversion need to worry about security issues. And only those of course, that security issues apply to in the first place. All other applications can and should treat those codepoints as letters. And convert them to UTF-8 just as any other valid codepoint. I may have suggested otherwise at some point in time, but this is my current position.

Note also that once your proposal is implemented, all valid
codepoints become convertible across all UTFs, without notice
(this is the principle of UTF that they allow transparent
conversions between each other).
Existing conversion is not modified. I am explaining how an alternate conversion works simply to prove it is useful. And it does not convert to UTF-8. It converts to byte sequences. And can be used in places where interfacing with such data. For example UNIX filenames. And 'supposedly UTF-8' is not the only case. The same technique can be used on 'supposedly Latin 3' data. The new conversions are used in pairs and existing UTF conversions remain as they are. Any security issues are up to whoever decides to use the new conversions. There are no security issues for those that do not.

Suppose that your proposal is accepted, and that invalid
bytes 0xnn in UTF-8 sources (these bytes are necessarily
between 0x80 and 0xFF) get encoded to some valid code units
U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they
become immediately and transparently convertible to valid
UTF-16 or even valid UTF-8. Your assumption that the byte
sequence will be preserved will be wrong, because each
encoded binary byte will become valid sequences of 3 or 4
UTF-8 bytes (one lead byte in 0xE0..EF if code points are in
the BMP, or in 0xF0..0xF7 if they are in a supplementary
plane, and 2 or 3 trail bytes in 0x80..0xBF).
Again, a UTF-8 to UTF-16 converter does not need to (and should not) encode the invalid sequences as valid codepoints. Existing rules apply. Signal, reject, replace with U+FFFD.

How do you think that other applications will treat these
sequences: they won't notice that they are originally
equivalent to the new valid sequences, and the byte sequence
itself would be transmitted across modules without any
warning (applications most often don't check whever
codepoints are assigned, just that they are valid and
properly encoded).
Exactly. This is why nothing breaks. And Unicode application should treat the new codepoints exactly the say it treats them today. Today they are unassigned and are converted according to existing rules. Once they are assigned, they just get some

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

Kenneth Whistler wrote:
Lars Kristan stated:

I said, the choice is yours. My proposal does not prevent
you from doing it
your way. You don't need to change anything and it will
still work the way
it worked before. OK? I just want 128 codepoints so I can
make my own
choice.

You have them: U+EE80..U+EEFF, which are yours to use (or abuse)
in an application as you see fit. Just don't expect others outside
your application to interpret them as you do.

Well, I DO want someone to interpret them the way I do. And display them. And let them be entered. And not risk a clash with someone else, we are talking about PUA, right?

And once and for all, you can treat those 128 codepoints just as you
do today.

A number of people on the list have patiently explained why what
you are proposing to do fundamentally breaks UTF-8 and its
relationship to other Unicode encoding forms.

It does not. I may have suggested at some point that the conversion from codepoins to UTF-8 should be changed. But I am no longer proposing that. The conversion to and from UTF-8 remains EXACTLY as it is today. I will use my own conversion as I see fit and deal with all the consequences. But I need 128 VALID codepoints. Not in PUA, not in any plane, but in BMP. And just because I say 'I' need, does not mean I am the only one.

One would judge who is right and who is not by the number of responses. But that is definitely not so. A couple of people keep responding and they have more or less the same theme. Which is because it has been rehearsed time and time again. I believe there are people who have long since realized that my claims are correct. But are just afraid to speak up. Also, wherever I win an argument, it is just dropped. In the end all that remains is a 'feeling' by a few people that 'this is not good'.

The chances that you will get the standard extended to incorporate
these 128 code points and define their mapping to invalid byte
values in UTF-8 is somewhere between zilch, nada, and nil.

No, not UTF-8. UTF-8 remains as it is. What I will do with them is my business. I am only telling you about it so you cannot dismiss it as 'encapsulating arbitrary binary data in Unicode'.

Lars

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode

Ken is absolutely right. It would be theoretically possible
to add 128 code
points that would allow one to roundtrip a bytestream after
passing through
a UTF-8 = UTF-32 conversion. (For that matter, it would be
possible to add
2048 code points that would allow the same for a 16-bit data stream.)
You don't really need to add anything for 16-bit = UTF-32. There is no real-life need to have that roundtrip guaranteed. For 8-bit data there is real-life need. And even, for 16-bit = UTF-32 you can do it simply by defining how surrogates should be processed. Not saying it should be done, but showing it could be done. But for UTF-8 = UTF-32 it cannot be done without 128 new codepoints. Which is why I am often comparing these 128 codepoints to the surrogates. With one difference, they should be valid characters.

However, these new code points would really be no better than
private use
code points, since their interpretation would depend entirely
Oh yes they would. Anyone might be using those same codepoints in PUA for something completely different.

on whatever
was assumed to be the interpretation of the original bytestream. If X
converted a bytestream that was assumed to be a mixture of
8858-7 with UTF-8
into Unicode with these new characters, and handed it off to Y, who
converted the bytestream back assuming that the odd bytes were to be
iso-8859-9, you would get data corruption. X and Y would have
Nope. No data corruption. You just get the odd bytes back. And achieve exactly the same as if X passed the data directly to Y. Y doesn't convert from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9. It converts UTF-8 to the original byte stream and ONLY THEN interpretes it as iso-8859-9. So, the same as if it got the data directly.

Lars

Re: Roundtripping in Unicode

2004-12-13 Thread Arcane Jill

If I have understood this correctly, filenames are not in a locale, they 
are absolute. Users, on the other hand, are in a locale, and users view 
filenames. The same filename can look different to two different users. To 
user A (whose locale is Latin-1), a filename might look valid; to user B 
(whose locale is UTF-8), the same filename might look invalid.

Is that right, Lars?
If so, Marcin, what exactly is the error, and whose fault is it?
Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Marcin 'Qrczak' Kowalczyk
Sent: 13 December 2004 14:59
To: [EMAIL PROTECTED]
Subject: Re: Roundtripping in Unicode
Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

Re: RE: Roundtripping in Unicode

2004-12-13 Thread John Cowan

Doug Ewell scripsit:

 When faced with [an] ill-formed code unit sequence while transforming
 or interpreting text, a conformant process must treat the first code
 unit... as an illegally terminated code unit sequence -- for example, by
 signaling an error, filtering the code unit out, or representing the
 code unit with a marker such as U+FFFD REPLACEMENT CHARACTER.

Plan 9, the original all-UTF-8 environment (it was translated
in a single day from Latin-1 to UTF-8), represents ill-formed code unit
sequences with the otherwise useless U+0080, on the grounds that an
ill-formed code is semantically different from an untranslatable
character, which is the purpose of U+FFFD.

-- 
LEAR: Dost thou call me fool, boy?  John Cowan
FOOL: All thy other titles  http://www.ccil.org/~cowan
 thou hast given away:  [EMAIL PROTECTED]
  That thou wast born with. http://www.reutershealth.com

RE: Roundtripping in Unicode

Title: RE: Roundtripping in Unicode





Peter Kirk wrote:


 Now no doubt many Unix filename handling utilities ignore the 
 fact that 
 some octets are invalid or uninterpretable in the locale, 
 because they 
 handle filenames as octet strings (with 0x00 and 0x2F having special 
 interpretations) rather than as locale-dependent character 
 strings. But 
 these routines should continue to work in a UTF-8 locale, as 
 they make 
 no attempt to interpret any octets other than 0x00 and 0x2F.


Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically, you are not supposed to use strcpy to process filenames.

Well, I just hope noone will listen to them and modify strcpy and strchr to validate the data when running in UTF-8 locale and start signalling something (really, where and how?!). The two statements from UTC don't make sense when put together. Unless we are really expected to start building everything from scratch.


 All of this is ingenious, and may be useful for internal processing 
 within a Unix system, and perhaps even for interaction between 
 cooperating systems. But NOT-Unicode is not Unicode (!) and 
 so Unicode 
 should not be expected to standardise it.
Not by definition. But if it would help the users since it would simplify the transition, then why not?



Lars

RE: Roundtripping in Unicode