RE: Roundtripping in Unicode

2004-12-16 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk wrote:
 Yes, IMHO all general-purpose languages should support processing
 arrays of bytes, in addition to Unicode strings.


C is likely to retain the behavior of the str functions. Although, it puts a lot of burden on the developers to identify all opaque strings and really handle them with those functions throughout the application (or even worse, a suite of applications not neccessarily written by the same company).

Newer languages are probably often designed with an assumption that all you need is a good class for Unicode strings. Instead of making them change that assumption, we could consider finding a way to make that true.

If a solution that doesn't break anything in Unicode cannot be found, then consider a solution that does break something, but check what the part that is broken really affects. For example, we assume it MUST be possible to represent a valid Unicode string in any UTF stream and get it back. Suppose you find a solution that retains that capability for all Unicode codepoints except for 128. If you know that those will ONLY be used for a particular purpose, you might be willing to accept that those who use those codepoints will deal with the problem and for those who don't the rules didn't really change. What I am saying is that we need to preserve the intention of the existing rules, not the rules themselves.

But again, this is if I was proposing that everybody starts using my conversion everywhere. Which at this point I am not.

 
 It's not clear however how the API of filenames should look like,
 especially if they wish to be portable to Windows.


I intend to bring up the issue in near future. And try to let everyone catch some breath before that.



 or delimit the filename with \0, or prefix it with
 the length, or something like this.


I don't see why that would be necessary or useful.



 A backup software should do this
 and not pay attention to the locale. But for end-user software like
 an image viewer, processing arbitrary filenames is less important.


You have to pay attention to the locale eventually. You need to report which file failed to be backed up (or is infected with a virus). And you should be able to let the user restore a single file. If you don't interpret it according to the locale (possibly UTF-8), user won't know how to select what she wants. Even worse if one wants to enter the filename manually. All this CAN be done within the application, but is very cumbersome. It gets worse if you want to pass some information to another software, since the other application may not have an interface to accept the opaque strings. If it does, the convention may differ. This is why I am saying that something should be standardized. Of course standardizing a poor solution is not a good idea. We should do our best to find a good one.


 Technically they are binary (command line arguments must not contain
 zero bytes). Users are expecting stdin and stdout to be treated as
 text or binary depending on the program, while command like arguments
 are generally interpreted as text or filenames.


So, an application outputting filenames has a binary stdout and no text application is guaranteed to process this output.


Lars





RE: Roundtripping in Unicode

2004-12-15 Thread D. Starner
Arcane Jill writes:

 The obvious solution is for all Unix machines everywhere to be using the same 
 locale - and it 
 had better be UTF-8. But an instantaneous global switch-over is never going 
 to happen, so we see 
 this gradual switch-over ... and it is during this transition phase that 
 Lars's problem 
 manifests. 

The only solution is (a) to use ASCII or (b) to make the switch over as quick 
and clean as possible. Anyone who wants to create new files in UTF-8 and leave 
their old files in the old encoding is asking for trouble. There's no magic
bullet, and complaining here as much as you want won't help. If you're a
system administrator, explain that to the people using your system, and
treat stupid responses just like you would any LART-worthy response.

-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm





RE: Roundtripping in Unicode

2004-12-15 Thread Arcane Jill
-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Philippe Verdy
Sent: 14 December 2004 22:47
To: Marcin 'Qrczak' Kowalczyk
Cc: [EMAIL PROTECTED]
Subject: Re: Roundtripping in Unicode
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Arcane Jill [EMAIL PROTECTED] writes:
If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
I confess I don't know much about Unix, but still, I'm not sure your 
assertion (Marcin) makes sense. Unix is a multi-user system. If you log on 
as User A, then User B's settings are hidden from you, unless User B has 
explicitly decided to share them. It may even be possible that there may be 
users of whose existence you are not even aware. Unix makes is possible for 
/you/ to change /your/ locale - but by your reasoning, this is an error, 
unless all other users do so simultaneously. Your reasoning implies that no 
Unix user should ever change their locale unless they have an absolute 
guarantee that all other users are going to do so simultaneously ... but I 
don't know if you can ever get such a guarantee. Or maybe you're saying that 
the error lies with Unix itself. Maybe that's fair comment, but I gather 
Unix was invented before Unicode, so it can hardly be blamed for breaking 
Unicode's conceptual model.

But it goes beyond that. Copy a file onto a floppy disc and then physically 
take that floppy disc to a different Unix machine and log on as guest and 
insert the disc ... Will the filename look the same? It would seem that the 
same system, is effectively every Unix machine on the planet, since files 
may be interchanged between them.

The obvious solution is for all Unix machines everywhere to be using the 
same locale - and it had better be UTF-8. But an instantaneous global 
switch-over is never going to happen, so we see this gradual switch-over ... 
and it is during this transition phase that Lars's problem manifests.

Phillipe adds...
More simply, I think that it's an error to have the encoding part of any
locale...
which again attaches blame to Unix itself. All very not my problem, but I 
think Lars has found that it actually /is/ his problem. (Not that I support 
his solution).

The system should not depend on them, and for critical things like
filesystem volumes, the encoding should be forced by the filesystem itself,
and applications should mandatorily follow the filesystem rules.
Of course, you are suggesting not /really/ suggesting that the Unix kernel 
be rewritten. But it's hard to for me to see how else this could be 
achieved.

Now think about the web itself: it's really a filesystem, with billions
users, or trillion applications using simultaneously hundreds or thousands
of incompatible encodings... Many resources on the web seem to have valid
URLs for some users but not for others, until URLs are made independant to
any user locale, and then not considered as encoded plain-text but only as
strings of bytes.
Oh yeah - and that too. Well spotted.
Jill



RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk replied:
 Arcane Jill [EMAIL PROTECTED] writes:
 
  If so, Marcin, what exactly is the error, and whose fault is it?
 
 It's an error to use locales with different encodings on the same
 system.


U, and whose fault is it?


You can advise the users against it, but they won't necessarily listen.


Switching to UTF-8 on UNIX opens two possibilities:


1 - Users that HAD different encodings on the same system will now only have one, namely UTF-8.
2 - Users that didn't have different encodings now may end up with different (and quite incompatible) encodings.


Assuming everything will happen quickly, and on all systems is ... well, ignorant.


Once it happens, offending filenames should be rare. One could creep in for various reasons, not limited to malicious attempts.

Automated or assisted upgrades to UTF-8 have been mentioned. For those that will be able to use them, great. I would even go a step further. I would icorporate a switch into UNIX filesystems that would enable a validator. This validator would reject invalid UTF-8 filenames from being created to start with (along with some other characters). This is quite un-UNIX-like, but then so it UTF-8. Perhaps then we can declare UNIX filenames as text. Well, for the most part. Except for some applications that WILL still need to be able to access all files even on systems whose users will not decide (perhaps for valid reasons!) to enable that validator.


Lars





RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Ops, correction:


In response to Marcin 'Qrczak' Kowalczyk
 Question: should a new programming language which uses Unicode for 
 string representation allow non-characters in strings? Argument for 
 allowing them: otherwise they are completely useless at all, except 
 U+FFFE for BOM detection. Argument for disallowing them: they make 
 UTF-n inappropriate for serialization of arbitrary strings, and thus 
 non-standard extensions of UTF-n must be used for serialization. 


I wrote:


My opinion: 
 It should allow them and process them usefully. Furthermore, this 
 'usefully' should not be up to developers to discover. It should be 
 researched, described, well, in the end even standardized. IMHO, UTC 
 should consider leading this process, even if it does not end with 
 anything standardized in Unicode standard.

 Validation should be completely separated from processing. IMHO. 


I wasn't paying attention to what Marcin wrote, namely the term non-characters.
What I wrote goes for invalid sequences and surrogates.


Lars





RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





D. Starner wrote:
 The only solution is (a) to use ASCII or (b) to make the 
 switch over as quick 
 and clean as possible. Anyone who wants to create new files 
 in UTF-8 and leave 
 their old files in the old encoding is asking for trouble. 
 There's no magic
 bullet, and complaining here as much as you want won't help. 
 If you're a
 system administrator, explain that to the people using your 
 system, and
 treat stupid responses just like you would any LART-worthy response.


A lone IT guy in a small company is not really in a position to take that stance. His user is also his boss. And it gets more complicated when thousands of systems in a network are involved. And if guys in the IT department realize the risks, and know they will be blamed for any inconvenience? Perhaps they will decide that the switch to UTF-8 is not really needed in their company. Though, some users will start using UTF-8 on their own. And come complaining about the problems. And IT will again try to balance what to do. Except now it's even worse, since not all filenames are in Latin 1. And so on.

Lars





RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Kenneth Whistler wrote:
 Lars said:
 
  According to UTC, you need to keep processing
  the UNIX filenames as BINARY data. And, also according to 
 UTC, any UTF-8
  function is allowed to reject invalid sequences. Basically, 
 you are not
  supposed to use strcpy to process filenames.
 
 This is a very misleading set of statements.
Perhaps deliberately so.


 
 First of all, the UTC has not taken *any* position on the
 processing of UNIX filenames.
At this point, I won't make any statement about whether UTC should or need not do that.
Let me just ask if it is appropriate to discuss such issues on this list?


 
 It is erroneous to imply that the UTC has indicated that you
 are not supposed to use strcpy to process filenames.
As long as explanatins about validation aren't misinterpreted by some people. Is there a thorough explanation of where and how to apply validation anywhere in the standard?

 
 Any process *interpreting* a UTF-8 code unit sequences as
 characters can and should recognize invalid sequences, but
 that is a different matter.
OK, strcpy does not need to interpret UTF-8. But strchr probably should. Or, is it that strchr is for opaque strings and mbschr is for UTF-8 strings? Then strchr should remain as is and be used for processing filenames. Hopefully, you do not need to search for Unicode characters in it and strchr-ing for '/' is all you need. But then all languages are supposed to provide functions for processing opaque strings in addition to their Unicode functions. Or, alternatively, they need to carefully define how string functions should process invalid sequences. If that can be done at all.

But sooner or later you need to incorporate the filename in some UTF-8 text. An error report, for example. You then need to program the boundaries quite carefully.

Not to mention the cost to maintain existing programs. I think it makes sense to keep looking for other solutions.


 
 If I pass the byte stream 0x80 0xFF 0x80 0xFF 0x80 0xFF to
 a process claiming conformance to UTF-8 and ask it to intepret
 that as Unicode characters, it should tell me that it is
 garbage. *How* it tells me that it is garbage is a matter of
 API design, code design, and application design.


What are stdin, stdout and argv (command line parameters) when a process is running in a UTF-8 locale? Binary? Opaque strings? UTF-8?

 Unicode did not invent the notion of conformance to character
 encoding standards. What is new about Unicode is that it has
 *3* interoperable character encoding forms, not just one, and
 all of them are unusual in some way, because they are designed
 for a very, very large encoded character repertoire, and
 involve multibyte and/or non-byte code unit representations.


The difference is that far more people will be faced with such problems.



Lars





RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk wrote:
 If one application switches from standard UTF-8 to your modification,
 and another application continues to use standard UTF-8, then the
 ability to pass arbitrary Unicode strings between them by serializing
 them to UTF-8 is lost. So you can't claim that does not affect
 programs which don't adopt it. It would have to be adopted by all
 programs which currently use UTF-8, or data exchange would break.


I don't think so. If I produce UTF-8 data from filenames, and give it to an UTF-8 application, nothing can be lost in the portion of this architecture that deals with Unicode data. Now, if you expect that you can give me Unicode data and I should store it in a filesystem (as a filename), then you're in error. It is definitely true that you can create a sequence of valid Unicode characters from my range and I will not be able to give it back. But I will also have to reject any '/' characters you feed me. You are misusing my application.

If some application chooses to use my conversion and looses or misinterpretes your data, then it is broken and shouldn't use that conversion or should not declare that particular interface as Unicode interface.

 But it's not a viable replacement of UTF-8. Even if both applications
 use your modification, the ability to serialize arbitrary sequences
 of valid code points (i.e. not surrogates) through UTF-8 is lost: the
 mapping to modified UTF-8 is not injective.
Yes, that is true. But there are people who would be willing to accept that since it only happens if those 128 codepoints are used. Those can use the conversion, others needn't.

OK, there is one problem that I *do* see with the use of my conversion. I map a file from UX to Win. You then use not my application, but another one, which copies the file back from Win to UX (and that is easier, so you *can* use this application). Now the invalid sequence is already escaped. If I map this new file to Win again, I need to escape the escape. They can start piling up.

Of course you can realize the problem, and simply rename the file, you can undo the over-escaping (no data is ever lost!), and probably rename that file to valid UTF-8, which is what you want anyway. And, you can do it even from the Windows system. If you prevent my solution, you will not have my program in the first place, meaning you will need to go to the UNIX system to rename the file, and that even in order to access it in the first place.

Actually, there are two subflavors of my conversion possible (I can hear you say oh, n). One does escape the escapes, the other doesn't. This second flavor can be used by applications that need to make UTF-8 from an arbitrary input, but do not need to re-create the original byte sequence. Basically, they are preserving all the data, except for the information how many times the original invalid sequences were escaped. There may be a need for such applications and they would in fact reduce the re-escaping problem.


 Which means that UTF-8 can't be replaced with your modification.
 If they coexisted, expect trouble when the two slightly incompatible
 encodings meet.
Or, expect trouble when dealing with data that is not guaranteed to be UTF-8. Or hope that there will be no such data, in near future, and I mean none.


  Using my conversion, Windows can access any file on UNIX, because my
  conversion guarantees roundtrip UX=Win=UX
 
 Well, with or without your conversion it's not true, because there
 are various characters which are valid in Unix filenames but not in
 Windows (e.g. ? * : \ and control characters). So if all filenames are
 to be accessible, they have to introduce some escaping. And as soon
 as an escaping scheme is used, it can be extended to encode isolated
 bytes with high bit set.
Good point. But you are assuming I copy the files to Windows filesystem. I don't. I have no problems if you specify your filename with any of the above characters, even from Windows.

And, BTW, suppose UTF-8 validation is introduced (as an option) on UNIX filesystems. The characters you mention (and some other, I can tell you exactly which don't work on Windows) could again be (optionally) rejected on UNIX filesystems.

  Win=UX=Win roundtrip is not guaranteed.
 
 Currently it breaks only for isolated surrogates (assuming the Unix
 is configured to use UTF-8). If Windows filenames are specified to be
 UTF-16, the error is clearly on the Windows side and this side should
 be fixed.
And in my case, it would break for some malicious sequences of the 128 codepoints. Equally rare, and with equal minor consequences. U, and it can be fixed, too. Such malicious sequences could be forbidden in contexts where we fear they might cause problems.


Lars





Re: Roundtripping in Unicode

2004-12-15 Thread Mark Davis
 Nope. No data corruption. You just get the odd bytes back. And achieve

I see more of what you are trying to do; let me try to be more clear.
Suppose that the conversion is defined in the following way, between Unicode
strings (D29a-d, page 74) and UTFs using your proposed new characters, for
now with private use code points U+E080..U+E0FF.

U8-UTF32. To convert an Unicode 8-bit string to UTF-32:
1. Set the pointer to the start
2. If the sequence starting at the pointer is a valid UTF-8 sequence
(checking of course to make sure it doesn't go off the end of the string),
convert it and emit.
3. Otherwise take the byte B following the pointer, and emit [E000 + B].

- Note that because all single bytes 00..7F are all valid UTF-8, #3
doesn't get invoked on anything but 80..FF.
4. Advance the pointer past what was used and repeat until done


UTF32-U8. To convert a UTF-32 to a Unicode 8-bit string:
1. Set the pointer to the start
2. If the code point C at the pointer is from E080 to E0FF, emit a single
byte, [C - E000]
3. Otherwise convert to the UTF-8 sequence and emit.
4. Advance the pointer past what was used and repeat until done


Taking any byte string, it would roundtrip when applying U8-UTF32 then
UTF32-U8. However, the reverse would not be true; UTF-32 strings would not
roundtrip through U8. For example,

start with UTF32: 00A0 E0C2 E0A0
applying UTF32-U8, goes to: C2 A0 C2 A0
applying U8-UTF32, goes to: 00A0 00A0

Of course, a UTF32-UTF8 transformation would preserve these code points

 00A0 E0C2 E0A0 = C2 A0 EE 83 82 EE 82 A0

so it would behave differently than the UTF32-U8 conversion.


Of course, one could apply this process between the Unicode bit strings and
UTFs of other widths. And the same thing applies; one direction would
roundtrip and the other wouldn't.

start with UTF8: C2 A0 EE 83 82 EE 82 A0
applying UTF8-U8, goes to: C2 A0 C2 A0
applying U8-UTF8, goes to: C2 A0 C2 A0


(I realize that some of this may duplicate what others have said -- I
haven't had the time to follow this thread in any detail.)

Mark

- Original Message - 
From: Lars Kristan
To: 'Mark Davis' ; Kenneth Whistler
Cc: [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 03:30
Subject: RE: Roundtripping in Unicode


 Ken is absolutely right. It would be theoretically possible
 to add 128 code
 points that would allow one to roundtrip a bytestream after
 passing through
 a UTF-8 = UTF-32 conversion. (For that matter, it would be
 possible to add
 2048 code points that would allow the same for a 16-bit data stream.)
You don't really need to add anything for 16-bit = UTF-32. There is no
real-life need to have that roundtrip guaranteed. For 8-bit data there is
real-life need. And even, for 16-bit = UTF-32 you can do it simply by
defining how surrogates should be processed. Not saying it should be done,
but showing it could be done. But for UTF-8 = UTF-32 it cannot be done
without 128 new codepoints. Which is why I am often comparing these 128
codepoints to the surrogates. With one difference, they should be valid
characters.

 However, these new code points would really be no better than
 private use
 code points, since their interpretation would depend entirely
Oh yes they would. Anyone might be using those same codepoints in PUA for
something completely different.
 on whatever
 was assumed to be the interpretation of the original bytestream. If X
 converted a bytestream that was assumed to be a mixture of
 8858-7 with UTF-8
 into Unicode with these new characters, and handed it off to Y, who
 converted the bytestream back assuming that the odd bytes were to be
 iso-8859-9, you would get data corruption. X and Y would have
Nope. No data corruption. You just get the odd bytes back. And achieve
exactly the same as if X passed the data directly to Y. Y doesn't convert
from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9.
It converts UTF-8 to the original byte stream and ONLY THEN interpretes it
as iso-8859-9. So, the same as if it got the data directly.


Lars




RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode






Marcin 'Qrczak' Kowalczyk wrote:
 But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 -
 NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
 awkward way which would happen to exclude those subsequences of
 non-characters which would form a valid UTF-8 fragment.
NOT-UTF-16 - NOT-UTF-8 - NOT-UTF-16 was never a goal. Nor was UTF-16 - NOT-UTF-8 - UTF-16, or NOT-UTF-16 - UTF-8 - NOT-UTF-16.

UTF-16 - UTF-8 - UTF-16 is preserved and that keeps the goals of UTF intact.


The goal, BTW, is: NOT-UTF-8 - UTF-16 - NOT-UTF-8.


 Question: should a new programming language which uses Unicode for
 string representation allow non-characters in strings? Argument for
 allowing them: otherwise they are completely useless at all, except
 U+FFFE for BOM detection. Argument for disallowing them: they make
 UTF-n inappropriate for serialization of arbitrary strings, and thus
 non-standard extensions of UTF-n must be used for serialization.
My opinion:
It should allow them and process them usefully. Furthermore, this 'usefully' should not be up to developers to discover. It should be researched, described, well, in the end even standardized. IMHO, UTC should consider leading this process, even if it does not end with anything standardized in Unicode standard.

Validation should be completely separated from processing. IMHO.



Lars





Re: Roundtripping in Unicode

2004-12-15 Thread Peter Kirk
On 15/12/2004 00:22, Mike Ayers wrote:
 From: Peter Kirk [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, December 14, 2004 3:37 PM
 Thanks for the clarification. Perhaps the bifurcation could
 be better expressed as into strings of characters as defined
 by the locale and strings of non-null octets. Then I could
 re-express this as the only safe way out of this mess is
 never to process filenames as strings of characters as
 defined by the locale.
That would not be correct for ISO 8859 locales, though 
(amongst others).  That's why I specified UTF-8.  Although other 
locales may have the problem of invalid sequences, we're only 
interested in UTF-8 here.

But surely octets 0x80 to 0x9f are (at least mostly) invalid in ISO 
8859? While some applications may choose to process these invalid 
characters as if they were valid, but display them as boxes or not at 
all (and this is a security risk), others and especially those concerned 
with security do in fact treat them as errors, in one way or another. 
For example, Marcin noted for Mozilla:

If a filename ... can be
converted but contains characters like 0x80-0x9F in ISO-8859-2,
they are displayed as question marks and the file is inaccessible.
It should be treated as a general issue with ALL locales and character 
sets (with perhaps just a few exceptions) that not all sequences of 
octets represent valid character strings. UTF-8 is by no means a special 
case here.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes:

 Unix makes is possible for /you/ to change /your/ locale - but by
 your reasoning, this is an error, unless all other users do so
 simultaneously.

Not necessarily: you can change the locale as long as it uses the same
default encoding.

By error I mean a bad idea. The system does not prevent from
changing the locale to a different encoding. But then you are on your
own and various things can break: terminal output will be mangled, you
can't enter characters used in a different encoding from the keyboard,
text files will be illegible, and Unicode programs which process texts
may reject your data or even filenames. If you still need to change
encodings, it's safer to use ASCII-only filenames.

This situation is temporary. Well, it may last 10 more years or so,
but it will probably gradually improve:

First, more protocols and file formats are becoming aware of character
encodings and either label them explicitly or use a known encoding
(generally some Unicode encoding scheme). Especially protocols for
data interchange over Internet: WWW, email, usenet, modern instant
messaging protocols like Jabber. Some old protocols remain
encoding-ignorant, e.g. irc and finger. GNOME 1 used the locale
encoding, GNOME 2 uses UTF-8. Copying  pasting text in X window now
has a separate API which uses UTF-8. While the irc protocol doesn't
specify the encoding, the irssi client can now recode texts itself
to conform to customs of particular channels.

Second, UTF-8 is becoming more usable as the default encoding
specified by the locale. I don't use it now because too many things
still break, but it's improving: there are things which didn't work
just a few years ago and work now. Terminal emulators in X widely
support UTF-8 mode now. The curses library now has a working wide
character API. Emacs and vi work in UTF-8 (Emacs still has problems).
Readline now works in UTF-8. Localized messages (gettext) are now
recoded automatically.

Other programs still don't work. Bash works, while zsh and ksh don't.
Most full-screen text programs use the narrow character curses API and
don't work in UTF-8. Brokenness of interactive interpreters of various
languages vary.

BTW, in the wide character curses API, the only way curses can work
in a UTF-8 terminal, characters are expressed as sequences of wchar_t
(base char + some combining chars, possibly double width). Which means
that you must somehow translate filenames to this representation
in order to display them - same as with a Unicode-based GUI. It's
meaningless to render arbitrary bytes on the terminal, and you can't
force curses to emit the original byte sequences which form filenames
(which would be a bad idea for control characters anyway). By
legimitizing non-UTF-8 filenames in a UTF-8 system you increase
problems to overcome by such applications: not only they have to
show control characters somehow, but also invalid UTF-8.

 But it goes beyond that. Copy a file onto a floppy disc and then
 physically take that floppy disc to a different Unix machine and log
 on as guest and insert the disc ... Will the filename look the same?

Depends on the filesystem and the way it is mounted.

For example if it's FAT with long filenames (which I think is the
usual format for floppies even on Unix), filenames can be recoded by
the kernel: you specify the encoding to present filenames in and the
encoding of short names. I don't know what happens with filenames
which are not expressible in the selected encoding.

In this way filenames may automatically convert between systems which
use different default encodings, preserving the character semantics
rather than the byte representation. Of course file contents will not
be converted.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/


RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode






Philippe Verdy wrote:
 I have not 
 found a solution to this problem, and I don't know if such 
 solution even 
 exists; if such solution exists, it should be quite complex...).


I think it should be possible to mathematically prove that it doesn't exist.


So, I claim you cannot achive NOT-UTF-8 = UTF-16 = NOT-UTF-8 and UTF-16 = NOT-UTF-8 = UTF-16 at the same time. But this is not really needed, since nothing of this affects any UTF trip (and none of the above is one).

And, the funny thing is - currently NOT2-UTF-16 = NOT2-UTF-8 = NOT2-UTF-16 *is* possible (NOT2, because it is not the same conversion, it is actually UCS2 conversion). But there is no need for it. NOT-UTF-8 = UTF-16 = NOT-UTF-8 is THE most valuable one. Outside of Unicode that is. Unicode could acknowledge that fact and yield 128 codepoints.


Lars





RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Arcane Jill wrote:
 The obvious solution is for all Unix machines everywhere to 
 be using the 
 same locale - and it had better be UTF-8. But an instantaneous global 
 switch-over is never going to happen, so we see this gradual 
 switch-over ... 
 and it is during this transition phase that Lars's problem manifests.
Yes, some may not experience it, some will experience it for a day, some for a month, some for a year, some indefinitely.

And unless filesystems prevent invalid sequences to be added, it will keep happening to everybody. And if very seldom, then it will be even harder to find a person who can fix it.

 Of course, you are suggesting not /really/ suggesting that 
 the Unix kernel 
 be rewritten. But it's hard to for me to see how else this could be 
 achieved.


What one might pursue is to make the UNIX filesystem invariant, so Windows-like. In that scenario, a filesystem stores Unicode strings and adjusts the representation of filenames according to user's locale. But there are two reasons against it:

A - If only the filesystem does it, then whenever you switch the locale, all references to files in other files break. Unless you treat the files in the same manner, which is what Windows does if an application is not Unicode (with a number of associated problems on top). But that is not what is supposed to be done on UNIX.

B - As we move to UTF-8, there will be less and less need to use different locales. So why bother with enabling the system to represent UTF-8 in any other locale if that locale will not even be used anymore. Concerns with the transition period do apply, but then you end up with two transitions, which is even less appealing.


So, the only percievable option is to start thinking about validation in the filesystem. If and when one choses to enable it. But keep in mind that it will only reduce the problem. Not all programs will be able to rely on it (like virus scanners, HSM, backup, ...).


Lars





Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 OK, strcpy does not need to interpret UTF-8. But strchr probably should.

No. Its argument is a byte, even though it's passed as type int.
By byte here I mean C char value, which is an octet in virtually
all modern C implementations; the C standard doesn't guarantee this
but POSIX does.

Many C functions are not suitable for processing UTF-8, or are
suitable only as long as we consider all non-ASCII characters opaque
bags of bytes. For example isalpha takes a byte, toupper transforms
a byte to a byte, and strncpy copies up to n bytes even if it's
in the middle of a UTF-8 character.

There are wide character versions like iswalpha and towupper. But then
data must be converted from a sequence of char to a sequence of wchar_t.
Standard and semi-standard function which do this conversion for UTF-8
reject invalid UTF-8 (they all have a mean for reporting errors).

The assumption that wchar_t has something do to with Unicode is not as
common as about char and bytes. I don't know whether FreeBSD finally
changed their wchar_t to Unicode. And it can be UTF-32 (Unix) or
UTF-16 (Windows).

 But then all languages are supposed to provide functions for
 processing opaque strings in addition to their Unicode functions.

Yes, IMHO all general-purpose languages should support processing
arrays of bytes, in addition to Unicode strings.

It's not clear however how the API of filenames should look like,
especially if they wish to be portable to Windows.

 But sooner or later you need to incorporate the filename in some
 UTF-8 text. An error report, for example.

While it's not clear what a well-behaved application should do by
default, in order to be 100% robust and preserve all information
you must change the usual conventions anyway. Remember that any byte
except \0 and / is valid in a filename, so you must either escape
some characters, or delimit the filename with \0, or prefix it with
the length, or something like this. A backup software should do this
and not pay attention to the locale. But for end-user software like
an image viewer, processing arbitrary filenames is less important.

 What are stdin, stdout and argv (command line parameters) when a
 process is running in a UTF-8 locale?

Technically they are binary (command line arguments must not contain
zero bytes). Users are expecting stdin and stdout to be treated as
text or binary depending on the program, while command like arguments
are generally interpreted as text or filenames.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/


RE: Roundtripping in Unicode

2004-12-15 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






 From: Peter Kirk [mailto:[EMAIL PROTECTED]] 
 Sent: Wednesday, December 15, 2004 3:52 AM


 But surely octets 0x80 to 0x9f are (at least mostly) invalid 
 in ISO 8859?


 They are in fact valid. However, because they are control characters, they are not considered displayable.


 While some applications may choose to process 
 these invalid characters as if they were valid, but display 
 them as boxes or not at all (and this is a security risk), 
 others and especially those concerned with security do in 
 fact treat them as errors, in one way or another. 
 For example, Marcin noted for Mozilla:
 
 If a filename ... can be
 converted but contains characters like 0x80-0x9F in ISO-8859-2, they 
 are displayed as question marks and the file is inaccessible.


 This is a good policy and is what Lars should consider. It places the responsibility for the filename where it belongs: on the file's creator.

 It should be treated as a general issue with ALL locales and 
 character sets (with perhaps just a few exceptions) that not 
 all sequences of octets represent valid character strings. 
 UTF-8 is by no means a special case here.


 Exactly. Which underscores just how silly these threads are.



/|/|ike




"Tumbleweed E-mail Firewall tumbleweed.com" made the following
 annotations on 12/15/04 09:50:11
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


RE: Roundtripping in Unicode

2004-12-14 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Arcane Jill wrote:
 I've been following this thread for a while, and I've pretty 
Thanks for bearing with me. And I hope my response will not discourage you from continuing to do so. That is, until I am banned from the list for heresy.

 much got the 
 hang of the issues here. To summarize:
 
 Unix filenames consist of an arbitrary sequence of octets, 
 excluding 0x00 
 and 0x2F. How they are /displayed/ to any given user depends 
 on that user's 
 locale setting. In this scenario, two users with different 
 locale settings 
 will see different filenames for the same file, but they will 
 still be able 
 to access the file via the filename that they see. These two 
 filenames will 
 be spelt identically in terms of octets, but (apparently) 
 differently when 
 viewed in terms of characters.
 
 At least, that's how it was until the UTF-8 locale came along. If we 
I think such problems were already present with Shift-JIS. But already stated once why this was not noticed and will not repeat myself, unless explicitly asked to do so.

 consider only one-byte-per-character encodings, then any 
 octet sequence is 
 valid in any locale. But UTF-8 introduces the possibility 
 that an octet 
 sequence might be invalid - a new concept for Unix. So if 
 you change your 
 locale to UTF-8, then suddenly, some files created by other 
 users might 
 appear to you to have invalid filenames (though they would 
 still appear 
 valid when viewed by the file's creator).
 
 A specific example: if a file F is accessed by two different 
 users, A and B, 
 of whom A has set their locale to Latin-1, and B has set 
 their locale to 
 UTF-8, then the filename may appear to be valid to user A, 
 but invalid to 
 user B.
 
 Lars is saying (and he's probably right, because he knows 
 more about Unix 
 than I) that user B does not necessarily have the right to 
 change the actual 
 octet sequence which is the filename of F, just to make it 
 appear valid to 
 user B, because doing so would stop a lot of things working 
 for user A (for 
 instance, A might have created the file, the filename might 
 be hardcoded in 
 a script, etc.). So Lars takes a Unix-like approach, saying 
 retain the 
 actual octet sequence, but feel free to try to display and 
 manipulate it as 
 if it were some UTF-8-like encoding in which all octet 
 sequences are valid. 
 And all this seems to work fine for him, until he tries to 
 roundtrip to 
 UTF-16 and back.
 
 I'm not sure why anyone's arguing about this though - 
 Phillipe's suggestion 
 seems to be the perfect solution which keeps everyone happy. So...
Well, it doesn't. The rest of my comments will show you why.


 
 ...allow me to construct a specific example of what Phillipe 
 suggested only 
 generally:
 
 DEFINITION - NOT-Unicode is the character repertoire 
 consisting of the 
 whole of Unicode, and 128 additional characters representing 
 integers in the 
 range 0x80 to 0xFF.
As long as we agree that the codepoints used to store the NOT-Unicode data are valid unicode codepoints. You noticed yourself that NOT-Unicode should roundtrip through UTF-16. Only valid Unicode codepoints can be safely passed through UTF-16.

 
 OBSERVATION - Unicode is a subset of NOT-Unicode
But unfortunately data can pass from NOT-Unicode to Unicode. Some people think that this is terribly bad. One would think that by storing NOT-UTF-8 in NOT-UTF-16 would prevent data from crossing the boundary, but that is not so.

 
 DEFINITION - NOT-UTF-8 is a bidirectional encoding between 
 a NOT-Unicode 
 character stream and an octet stream, defined as follows: if 
 a NOT-Unicode 
 character is a Unicode character then its encoding is the 
 UTF-8 encoding of 
 that character; else the NOT-Unicode character must represent 
 an integer, in 
 which case its encoding is itself. To decode, assume the next 
 NOT-Unicode 
 character is a Unicode character and attempt to decode from 
 the octet stream 
 using UTF-8; if this fails then the NOT-Unicode character is 
 an integer, in 
 which case read one single octet from the stream and return it.
More or less. You have not defined how to return the octet. It must be returned as a valid Unicode codepoint. And if a Unicode character is decoded, one must check if it is any of the codepoints used for this purpose and escape it. But only when decoding NON-UTF-8. Decoding from UTF-8 remains unchanged.

 
 OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
Yes, that's the sanity check, because this is what we wanted to get.


 
 OBSERVATION - NOT-Unicode characters which are Unicode 
 characters will be 
 encoded identically in UTF-8 and NOT-UTF-8
Unfortunately not so. Becase you started with the wrong assumption that NOT-UTF-8 data will not be stored in valid codepoints. But the fact that this observation is not true is not really a problem.

 
 OBSERVATION - NOT-Unicode characters which are not Unicode 
 characters cannot 
 be represented

Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan
Peter Kirk scripsit:

 I think the problem here is that a Unix filename is a string of octets, 
 not of characters. And so it should not be converted into another 
 encoding form as if it is characters; it should be processed at a quite 
 different level of interpretation.

Unfortunately, that is simply a counsel of perfection.

Unix filenames are in general input as character strings, output as character
strings, and intended to be perceived as character strings.  The corner
cases in which this does not work are not sufficient to overthrow the
power and generality to be achieved by assuming it 99% of the time.

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it
truly is a trick, and in any case depends only on an ASCII interpretation.)

-- 
Income tax, if I may be pardoned for saying so, John Cowan
is a tax on income.  --Lord Macnaghten (1901)   [EMAIL PROTECTED]



Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes:

 OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -
 NOT-UTF-16 - NOT-UTF-8

But it's not possible in the direction NOT-UTF-16 - NOT-UTF-8 -
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would happen to exclude those subsequences of
non-characters which would form a valid UTF-8 fragment.

Unicode has the following property. Consider sequences of valid
Unicode characters: from the range U+..U+10, excluding
non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and
U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
in any UTF-n, and nothing else is expected from UTF-n.

With the exception of the set of non-characters being irregular and
IMHO too large (why to exclude U+FDD0..U+FDEF?!), and a weird top
limit caused by UTF-16, this gives a precise and unambiguous set of
values for which encoders and decoders are supposed to work. Well,
except non-obvious treatment of a BOM (at which level it should be
stripped? does this include UTF-8?).

A variant of UTF-8 which includes all byte sequences yields a much
less regular set of abstract string values. Especially if we consider
that 1110 1011 1010 binary is not valid UTF-8, as much as
0xFFFE is not valid UTF-16 (it's a reversed BOM; it must be invalid in
order for a BOM to fulfill its role).

Question: should a new programming language which uses Unicode for
string representation allow non-characters in strings? Argument for
allowing them: otherwise they are completely useless at all, except
U+FFFE for BOM detection. Argument for disallowing them: they make
UTF-n inappropriate for serialization of arbitrary strings, and thus
non-standard extensions of UTF-n must be used for serialization.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 Hm, here lies the catch. According to UTC, you need to keep
 processing the UNIX filenames as BINARY data. And, also according
 to UTC, any UTF-8 function is allowed to reject invalid sequences.
 Basically, you are not supposed to use strcpy to process filenames.

No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not an UTF-8 function.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk
On 14/12/2004 11:32, Arcane Jill wrote:
I've been following this thread for a while, and I've pretty much got 
the hang of the issues here. To summarize:

I haven't followed everything, but here is my 2 cents worth.
I note that there is a real problem. I have had significant problems in 
Windows with files copied from other language systems. Sometimes for 
example these files are listed fine in Explorer but when I try to copy 
or delete them they are not found, presumably because the filename is 
being corrupted somewhere in the system and doesn't match.

Unix filenames consist of an arbitrary sequence of octets, excluding 
0x00 and 0x2F. How they are /displayed/ to any given user depends on 
that user's locale setting. In this scenario, two users with different 
locale settings will see different filenames for the same file, but 
they will still be able to access the file via the filename that they 
see. These two filenames will be spelt identically in terms of octets, 
but (apparently) differently when viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we 
consider only one-byte-per-character encodings, then any octet 
sequence is valid in any locale. But UTF-8 introduces the 
possibility that an octet sequence might be invalid - a new concept 
for Unix. So if you change your locale to UTF-8, then suddenly, some 
files created by other users might appear to you to have invalid 
filenames (though they would still appear valid when viewed by the 
file's creator).

This is not in fact a new concept. Some octet sequences which are valid 
filenames are invalid in a Latin-1 locale - for example, those which 
include octets in the range 0x80-0x9F, if Latin-1 means ISO 8859-1. 
Some of these octets are of course defined in Windows CP1252 etc, so a 
Unix Latin-1 system may have some interpretation for some of them; but 
others e.g. 0x81 have no interpretation in any flavour of Latin-1 as far 
as I know. So there is by no means a guarantee that every non-Unicode 
Unix locale has an interpretation of every octet, which implies that 
other octets are invalid.

Now no doubt many Unix filename handling utilities ignore the fact that 
some octets are invalid or uninterpretable in the locale, because they 
handle filenames as octet strings (with 0x00 and 0x2F having special 
interpretations) rather than as locale-dependent character strings. But 
these routines should continue to work in a UTF-8 locale, as they make 
no attempt to interpret any octets other than 0x00 and 0x2F.

A specific example: if a file F is accessed by two different users, A 
and B, of whom A has set their locale to Latin-1, and B has set their 
locale to UTF-8, then the filename may appear to be valid to user A, 
but invalid to user B.

Lars is saying (and he's probably right, because he knows more about 
Unix than I) that user B does not necessarily have the right to change 
the actual octet sequence which is the filename of F, just to make it 
appear valid to user B, because doing so would stop a lot of things 
working for user A (for instance, A might have created the file, the 
filename might be hardcoded in a script, etc.). So Lars takes a 
Unix-like approach, saying retain the actual octet sequence, but feel 
free to try to display and manipulate it as if it were some UTF-8-like 
encoding in which all octet sequences are valid. And all this seems 
to work fine for him, until he tries to roundtrip to UTF-16 and back.

I think the problem here is that a Unix filename is a string of octets, 
not of characters. And so it should not be converted into another 
encoding form as if it is characters; it should be processed at a quite 
different level of interpretation.

Of course a system is free to do what it wants internally.
I'm not sure why anyone's arguing about this though - Phillipe's 
suggestion seems to be the perfect solution which keeps everyone 
happy. So...

...allow me to construct a specific example of what Phillipe suggested 
only generally:

...
This would appear to solve Lars' problem, and because the three 
encodings, NOT-UTF-8, NOT-UTF-16 and NOT-UTF-32, don't claim to be 
UTFs, no-one need get upset.

All of this is ingenious, and may be useful for internal processing 
within a Unix system, and perhaps even for interaction between 
cooperating systems. But NOT-Unicode is not Unicode (!) and so Unicode 
should not be expected to standardise it.

I can see that there may be a need for a protocol for open exchange of 
Unix-like filenames. But these filenames should be treated as binary 
data (which may or may not be interpretable in any one locale) and 
encoded as such, rather than forced into the mould of Unicode characters 
which it does not fit.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



RE: Roundtripping in Unicode

2004-12-14 Thread Kenneth Whistler
Lars said:

 According to UTC, you need to keep processing
 the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8
 function is allowed to reject invalid sequences. Basically, you are not
 supposed to use strcpy to process filenames.

This is a very misleading set of statements.

First of all, the UTC has not taken *any* position on the
processing of UNIX filenames. That is an implementation issue
outside the scope of what the UTC normally deals with, and I
doubt that it will take a position on the issue.

It is erroneous to imply that the UTC has indicated that you
are not supposed to use strcpy to process filenames. It has
done nothing of the kind, and I don't know of any reason why
anyone should think otherwise. I certainly use strcpy to process
filenames, UTF-8 or not, and expect that nearly every implementer
on the list has done so, too.

Any process *interpreting* a UTF-8 code unit sequences as
characters can and should recognize invalid sequences, but
that is a different matter.

If I pass the byte stream 0x80 0xFF 0x80 0xFF 0x80 0xFF to
a process claiming conformance to UTF-8 and ask it to intepret
that as Unicode characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

But there is *nothing* new here.

If I pass the byte stream 0x80 0xFF 0x80 0xFF 0x80 0xFF to
a process claiming conformance to Shift-JIS and ask it to intepret
that as JIS characters, it should tell me that it is
garbage. *How* it tells me that it is garbage is a matter of
API design, code design, and application design.

Unicode did not invent the notion of conformance to character
encoding standards. What is new about Unicode is that it has
*3* interoperable character encoding forms, not just one, and
all of them are unusual in some way, because they are designed
for a very, very large encoded character repertoire, and
involve multibyte and/or non-byte code unit representations.

 Well, I just hope noone will listen to them and modify strcpy and strchr to
 validate the data when running in UTF-8 locale and start signalling
 something (really, where and how?!). The two statements from UTC don't make
 sense when put together. Unless we are really expected to start building
 everything from scratch.

This is bogus. The UTC has never asked anyone to modify strcpy
and strchr. What anyone implementing UTF-8 using a C runtime
library (or similar set of functions) has to do is completely
comparable to what they have to do for supporting any other
multibyte character encoding on such systems. If your system
handles euc-kr, euc-tw, and/or euc-jp correctly, then adding
UTF-8 support is comparable, in principle and in practice.

--Ken




Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Arcane Jill [EMAIL PROTECTED] writes:

 If so, Marcin, what exactly is the error, and whose fault is it?

It's an error to use locales with different encodings on the same
system.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Lars Kristan [EMAIL PROTECTED] writes:
Hm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basically, you are not supposed to use strcpy to process filenames.
No: strcpy passes raw bytes, it does not interpret them according to
the locale. It's not an UTF-8 function.
Correct: [wc]strcpy() handles string instances, but not all string 
instances are plain-text, so they don't need to obey to UTF encoding rules 
(they just obey to the convention of null-byte termination, with no 
restriction on the string length, measured as a size in [w]char[_t] but not 
as a number of Unicode characters).

This is true for the whole standard C/C++ string libraries, as well as in 
Java (String and Char objects or native char datatype), and as well in 
almost all string handling libraries of common programming languages.

A locale defined as UTF-8 will experiment lots of problems because of 
the various ways applications will behave face to encoding errors 
encountered in filenames: exceptions thrown aborting the program, 
substitution by ? or U+FFFD causing wrong files to be accessed, some files 
not treated because their name was considered invalid althoug they were 
effectively created by some user of another locale...

Filenames are identifiers coded as strings, not as plain-text (even if most 
of these filename strings are plain-text).

The solution if then to use a locale based on a relaxed version of UTF-8 
(some spoke about defining a NOT-UTF-8 and NOT-UTF-16 encodings to allow 
any sequence of code units, but nobody has thought about how to make 
NOT-UTF-8 and NOT-UTF-16 mutually fully reversible; now add NOT-UTF-32 
to this nightmare and you will see that NOT-UTF-32 needs to encode 2^32 
distinct NOT-Unicode-codepoints, and that they must map bijectively to 
exactly all 2^32 sequences possible in NOT-UTF-16 and NOT-UTF-8; I have not 
found a solution to this problem, and I don't know if such solution even 
exists; if such solution exists, it should be quite complex...).




Re: Roundtripping in Unicode

2004-12-14 Thread Doug Ewell
 Unicode did not invent the notion of conformance to character
 encoding standards. What is new about Unicode is that it has
 *3* interoperable character encoding forms, not just one, and
 all of them are unusual in some way, because they are designed
 for a very, very large encoded character repertoire, and
 involve multibyte and/or non-byte code unit representations.

Geez, even when I was going through my stage of inventing wild and crazy
new UTF's, I made sure they were 100% convertible to and from code
points.  How could they not be?

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED]] On Behalf Of Mike Ayers
 Sent: Tuesday, December 14, 2004 3:29 PM


 The rule is No zero, no eight. 


 No zero, no forty seven.


 My bad.




/|/|ike




"Tumbleweed E-mail Firewall tumbleweed.com" made the following
 annotations on 12/14/04 16:25:28
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






 From: Peter Kirk [mailto:[EMAIL PROTECTED]] 
 Sent: Tuesday, December 14, 2004 3:37 PM


 Thanks for the clarification. Perhaps the bifurcation could 
 be better expressed as into strings of characters as defined 
 by the locale and strings of non-null octets. Then I could 
 re-express this as the only safe way out of this mess is 
 never to process filenames as strings of characters as 
 defined by the locale.


 That would not be correct for ISO 8859 locales, though (amongst others). That's why I specified UTF-8. Although other locales may have the problem of invalid sequences, we're only interested in UTF-8 here.

 Well, I was assuming that when John Cowan implied that 0x08 
 was permitted, and Jill wrote Unix filenames consist of an 
 arbitrary sequence of octets, excluding 0x00 and 0x2F, they 
 were speaking from the appropriate orifices.


 Correct, and my bad. I got thrown off by John's:


(A private correspondent has come up with an ingenious trick which 
depends on being able to create files named 0x08 and 0x7F, but it truly 
is a trick, and in any case depends only on an ASCII interpretation.)


 which I misinterpreted to mean that 0x08 was a forbidden character. It isn't - just real hard to type!



/|/|ike




"Tumbleweed E-mail Firewall tumbleweed.com" made the following
 annotations on 12/14/04 16:24:51
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


RE: Roundtripping in Unicode

2004-12-14 Thread Arcane Jill
I've been following this thread for a while, and I've pretty much got the 
hang of the issues here. To summarize:

Unix filenames consist of an arbitrary sequence of octets, excluding 0x00 
and 0x2F. How they are /displayed/ to any given user depends on that user's 
locale setting. In this scenario, two users with different locale settings 
will see different filenames for the same file, but they will still be able 
to access the file via the filename that they see. These two filenames will 
be spelt identically in terms of octets, but (apparently) differently when 
viewed in terms of characters.

At least, that's how it was until the UTF-8 locale came along. If we 
consider only one-byte-per-character encodings, then any octet sequence is 
valid in any locale. But UTF-8 introduces the possibility that an octet 
sequence might be invalid - a new concept for Unix. So if you change your 
locale to UTF-8, then suddenly, some files created by other users might 
appear to you to have invalid filenames (though they would still appear 
valid when viewed by the file's creator).

A specific example: if a file F is accessed by two different users, A and B, 
of whom A has set their locale to Latin-1, and B has set their locale to 
UTF-8, then the filename may appear to be valid to user A, but invalid to 
user B.

Lars is saying (and he's probably right, because he knows more about Unix 
than I) that user B does not necessarily have the right to change the actual 
octet sequence which is the filename of F, just to make it appear valid to 
user B, because doing so would stop a lot of things working for user A (for 
instance, A might have created the file, the filename might be hardcoded in 
a script, etc.). So Lars takes a Unix-like approach, saying retain the 
actual octet sequence, but feel free to try to display and manipulate it as 
if it were some UTF-8-like encoding in which all octet sequences are valid. 
And all this seems to work fine for him, until he tries to roundtrip to 
UTF-16 and back.

I'm not sure why anyone's arguing about this though - Phillipe's suggestion 
seems to be the perfect solution which keeps everyone happy. So...

...allow me to construct a specific example of what Phillipe suggested only 
generally:

DEFINITION - NOT-Unicode is the character repertoire consisting of the 
whole of Unicode, and 128 additional characters representing integers in the 
range 0x80 to 0xFF.

OBSERVATION - Unicode is a subset of NOT-Unicode
DEFINITION - NOT-UTF-8 is a bidirectional encoding between a NOT-Unicode 
character stream and an octet stream, defined as follows: if a NOT-Unicode 
character is a Unicode character then its encoding is the UTF-8 encoding of 
that character; else the NOT-Unicode character must represent an integer, in 
which case its encoding is itself. To decode, assume the next NOT-Unicode 
character is a Unicode character and attempt to decode from the octet stream 
using UTF-8; if this fails then the NOT-Unicode character is an integer, in 
which case read one single octet from the stream and return it.

OBSERVATION - All possible octet sequences are valid NOT-UTF-8.
OBSERVATION - NOT-Unicode characters which are Unicode characters will be 
encoded identically in UTF-8 and NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot 
be represented in UTF-8

DEFINITION - NOT-UTF-16 is a bidirectional encoding between a NOT-Unicode 
character stream and a 16-bit word stream, defined as follows: if a 
NOT-Unicode character is a Unicode character then its encoding is the UTF-16 
encoding of that character; else the NOT-Unicode character must represent an 
integer, in which case its encoding is 0xDC00 plus the integer. To decode, 
if the next 16-bit word is in the range 0xDC80 to 0xDCFF then the 
NOT-Unicode character is the integer whose value is (word16 - 0xDC00), else 
the NOT-Unicode character is the Unicode character obtained by decoding as 
if UTF-16.

OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 - 
NOT-UTF-16 - NOT-UTF-8

OBSERVATION - NOT-Unicode characters which are Unicode characters will be 
encoded identically in UTF-16 and NOT-UTF-16

OBSERVATION - NOT-Unicode characters which are not Unicode characters cannot 
be represented in UTF-16

DEFINITION - NOT-UTF-32 is a bidirectional encoding between a NOT-Unicode 
character stream and a 32-bit word stream, defined as follows: if a 
NOT-Unicode character is a Unicode character then its encoding is the UTF-32 
encoding of that character; else the NOT-Unicode character must represent an 
integer, in which case its encoding is 0xDC00 plus the integer. To 
decode, if the next 32-bit word is in the range 0xDC80 to 0xDCFF 
then the NOT-Unicode character is the octet whose value is (word32 - 
0xDC00), else the NOT-Unicode character is the Unicode character 
obtained by decoding as if UTF-16.

OBSERVATION - Roundtripping is possible in the directions NOT-UTF-8 - 
NOT-UTF-32 - 

UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)

2004-12-14 Thread Edward H. Trager
On Tuesday 2004.12.14 12:50:43 -, Arcane Jill wrote:
 If I have understood this correctly, filenames are not in a locale, they 
 are absolute. Users, on the other hand, are in a locale, and users view 
 filenames. The same filename can look different to two different users. 
 To user A (whose locale is Latin-1), a filename might look valid; to user B 
 (whose locale is UTF-8), the same filename might look invalid.

Correct. The problem will however be limited to the accented
Latin characters present in ISO-8859-1 beyond the ASCII set.  The basic Latin
alphabet in the ASCII set
at the beginning of both ISO-8859-1 and UTF-8 will appear unchanged to both 
users (UTF-8 user looking at Latin-1's home directory, or Latin-1 looking at
UTF-8's home directory).  So both users could probably guess the filename
they were looking at.  For example, here is a file on my local machine,
a Linux box with the locale set to LANG=en_US.UTF-8:

  déclaration_des_droits.utf8

The accented e in déclaration appears correctly under the UTF-8 locale.

I then copied this file (using scp) over to an older Sun Solaris box which I do 
not administer,
so I have to live with the C POSIX locale that they have got that machine
set to.  Now, when I
view the file names in a terminal (where the terminal emulator is set to
the same locale), I see:

  d??claration_des_droits.utf8

The terminal, being set to interpret the legacy locale, does not know 
how to interpret the two bytes that are used for the UTF-8 é.
Still, I can guess that the first word should be déclaration.

The solution, as has been pointed out, is for everyone to move to
UTF-8 locales.  In the Linux and Unix world, this is already happening
for the most part.  Solaris 10 now defaults to a UTF-8 locale, at least
when set to English.  Both SuSE and Redhat default to UTF-8 locales
for most language and script environments.  And (open source) tools exist for
converting file names from one encoding to another encoding on Linux
and Unix systems.  A group of Japanese developers is working on an NLS 
implementation
for the BSDs like OpenBSD which are currently stuck with nothing but the C
POSIX locale.  I think the name of that project is Citrus.

-- Ed Trager

   

 
 Is that right, Lars?
 
 If so, Marcin, what exactly is the error, and whose fault is it?
 
 Jill
 
 -Original Message-
 
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
 
 Behalf Of Marcin 'Qrczak' Kowalczyk
 
 Sent: 13 December 2004 14:59
 
 To: [EMAIL PROTECTED]
 
 Subject: Re: Roundtripping in Unicode
 
 Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
 
 
 
 
 
 


Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk
On 14/12/2004 17:47, John Cowan wrote:
Peter Kirk scripsit:
 

I think the problem here is that a Unix filename is a string of octets, 
not of characters. And so it should not be converted into another 
encoding form as if it is characters; it should be processed at a quite 
different level of interpretation.
   

Unfortunately, that is simply a counsel of perfection.
Unix filenames are in general input as character strings, output as character
strings, and intended to be perceived as character strings.  The corner
cases in which this does not work are not sufficient to overthrow the
power and generality to be achieved by assuming it 99% of the time.
 

This is a design flaw in Unix, or in how it is explained to users. Well, 
Lars wrote Basically, you are not supposed to use strcpy to process 
filenames. I'm not sure if that is his opinion or someone else's, but 
the only safe way out of this mess is never to process filenames as strings.

(A private correspondent has come up with an ingenious trick which
depends on being able to create files named 0x08 and 0x7F, but it
truly is a trick, and in any case depends only on an ASCII interpretation.)
 

This may be called a trick but it looks like it could very easily be a 
security hole. For example, a filename 0x41 0x08 0x42 will be displayed 
the same as just 0x42, in a Latin-1 or UTF-8 locale. Your friend's trick 
has become an open door for spoofers.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/



RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED]] On Behalf Of Peter Kirk
 Sent: Tuesday, December 14, 2004 11:32 AM


 This is a design flaw in Unix, or in how it is explained to 
 users. Well, Lars wrote Basically, you are not supposed to 
 use strcpy to process filenames. I'm not sure if that is his 
 opinion or someone else's, but the only safe way out of this 
 mess is never to process filenames as strings.


 As mentioned by Kenneth, Lars was speaking from the wrong orifice when he said that.


 Also, it appears that the term string is being used too much and without qualification. The entire focus of this thread is on what happens when unqualified bytes (filenames) get qualified (by locale), so it would behoove us all to qualify all the strings we're talking about. For instance, Peter's last clause above bifurcates into:

 ...but the only safe way out of this mess is never to process filenames as UTF-8 strings.


 and:


 ...but the only safe way out of this mess is always to process filenames as opaque C strings.


 which was mentioned early on in this thread, but Lars does not wish to do this.


 This may be called a trick but it looks like it could very 
 easily be a security hole. For example, a filename 0x41 0x08 
 0x42 will be displayed the same as just 0x42, in a Latin-1 or 
 UTF-8 locale. Your friend's trick has become an open door for 
 spoofers.


 Exactly why 0x08 was banned in filenames, as I recall.



/|/|ike




"Tumbleweed E-mail Firewall tumbleweed.com" made the following
 annotations on 12/14/04 13:16:29
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy
From: Marcin 'Qrczak' Kowalczyk [EMAIL PROTECTED]
Arcane Jill [EMAIL PROTECTED] writes:
If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
More simply, I think that it's an error to have the encoding part of any 
locale... The system should not depend on them, and for critical things like 
filesystem volumes, the encoding should be forced by the filesystem itself, 
and applications should mandatorily follow the filesystem rules.

Now think about the web itself: it's really a filesystem, with billions 
users, or trillion applications using simultaneously hundreds or thousands 
of incompatible encodings... Many resources on the web seem to have valid 
URLs for some users but not for others, until URLs are made independant to 
any user locale, and then not considered as encoded plain-text but only as 
strings of bytes.




RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode






 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED]] On Behalf Of Philippe Verdy
 Sent: Tuesday, December 14, 2004 2:47 PM


 More simply, I think that it's an error to have the encoding 
 part of any locale... The system should not depend on them, 
 and for critical things like filesystem volumes, the encoding 
 should be forced by the filesystem itself, and applications 
 should mandatorily follow the filesystem rules.


 It doesn't, it is, and they do.


 The rule is No zero, no eight.


 The problem is that these valid filenames can't all be translated as valid UTF-8 Unicode.


 Now think about the web itself: it's really a filesystem, 


 No. It isn't.


 with billions users, or trillion applications using 
 simultaneously hundreds or thousands of incompatible 
 encodings... Many resources on the web seem to have valid 
 URLs for some users but not for others, until URLs are made 
 independant to any user locale, and then not considered as 
 encoded plain-text but only as strings of bytes.


 I thought that URLs were specified to be in Unicode. Am I mistaken?



/|/|ike



P.S. [OT} Note the below autoattachment. I recall that we discussed such clauses on the list some time ago with regard to their legal standing. Does anyone have a pointer to substantive material on the subject? I've gotten curious again, 'natch.



"Tumbleweed E-mail Firewall tumbleweed.com" made the following
 annotations on 12/14/04 15:31:51
--
This e-mail, including attachments, may include confidential and/or proprietary information, and may be used only by the person or entity to which it is addressed.  If the reader of this e-mail is not the intended recipient or his or her authorized agent, the reader is hereby notified that any dissemination, distribution or copying of this e-mail is prohibited. If you have received this e-mail in error, please notify the sender by replying to this message and delete this e-mail immediately.
==


Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan
Mike Ayers scripsit:

   I thought that URLs were specified to be in Unicode.  Am I mistaken?

You are.  URLs are specified to be in *ASCII*.  There is a %-encoding
hack that allows you to represent random-octet filenames as ASCII.
Some people (including me) think it's a good idea to use this hack
to specify non-ASCII characters with double encoding (first as UTF-8,
then with the %-hack), but the URI Syntax RFC doesn't say.

-- 
John Cowan  [EMAIL PROTECTED]
http://www.reutershealth.comhttp://www.ccil.org/~cowan
Humpty Dump Dublin squeaks through his norse
Humpty Dump Dublin hath a horrible vorse
But for all his kinks English / And his irismanx brogues
Humpty Dump Dublin's grandada of all rogues.  --Cousin James


RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk wrote:
 You are trying to stick with processing byte sequences, carefully
 preserving the storage format instead of preserving the meaning in
 terms of Unicode characters. This leads to less robust software
 which is not certain about the encoding of texts it processes and
 thus can't apply algorithms like case mapping without risking doing
 a meaningless damage to the text.
I am not proposing that this approach is better or that it should be used generally. What I am saying is that this approach is, unfortunately, needed in order to make the transition easier. The fact is that currently data exists that cannot be converted easily. An over-robust software, in my opinion, can be impratcical and might not be accepted with open hands. We should acknowledge the fact that some products will chose a different path. You can say these applications will be less robust, but we should really give the the users a choice and let them decide what they want.

 Conversion should signal an error by default. Replacing errors by
 U+FFFD should be done only when the data is processed purely for
 showing it to the user, without any further processing, i.e. when it's
 better to show the text partially even if we know that it's corrupted.
I think showing it to the user is not the only case when you need to use U+FFFD. A text viewer could do the replacement when reading the file and do further processing in Unicode. But an editor cannot. Keeping the text in original binary form is far from practical and opens numerous possibilities for bugs. But, as I once already said, you can do it with UTF-8, you simply keep the invalid sequences as they are, and really handle them differently only when you actually process them or display them. But you cannot do this in UTF-16, since you cannot preserve all the data.

As for signalling - in some cases signalling is impossible. Listing files in a directory should not signal anything. It MUST return all files and it should also return them in a way that this list can be used to access each of the files.

 
  Either you do everything in UTF-8, or everything in UTF-16. Not
  always, but typically. If comparisons are not always done in the
  same UTF, then you need to validate. And not validate while
  converting, but validate on its own. And now many designers will
  remember that they didn't. So, all UTF-8 programs (of that kind)
  will need to be fixed. Well, might as well adopt my broken
  conversion and fix all UTF-16 programs. Again, of that kind, not all
  in general, so there are few. And even those would not be all
  affected. It would depend on which conversion is used where. Things
  could be worked out. Even if we would start changing all the
  conversions. Even more so if a new conversion is added and only used
  when specifically requested.
 
 I don't understand anything of this.
Let's start with UTF-8 usernames. This is a likely scenario, since I think UTF-8 will typically be used in network communication. If you store the usernames in UTF-16, the conversion will signal an error and you will not have any users with invalid UTF-8 sequences nor will any invalid sequence be able to match any user. If you later on start comparing users somewhere else, in UTF-8, then you must not only strcmp them, but also validate each string. This is just a fact and I am not complaining about it.

In the opposite case, if you would have UTF-8 storage and UTF-16 communication, and any comparisons would be done in UTF-16, you again need to validate the UTF-16 strings.

Now I am supposing that there are such applications already out there. And that some of them do not validate (or validate only in conversion, but not when comparing or otherwise processing native strings).

They should be analyzed and fixed. At the time I wrote the above paragraph, I though UTF-16 programs don't need to validate, but that is not true, so all the applications need to be fixed, if they are not already validating.

Now, suppose my 'broken' conversion is standardized. As an option, not for UTF-16 to UTF-8 conversion. If you don't start using it, the existing rules apply.

The interesting thing is that if you do start using my conversion, you can actually get rid of the need to validate UTF-8 strings in the first scenario. That of course means you will allow users with invalid UTF-8 sequences, but if one determines that this is acceptable (or even desired), then it makes things easier. But the choice is yours.

For the second scenario, things do indeed become a bit more complicated. But can be solved. And there is still a number of choices you can make about the level of validation. And, again, one of them is that you keep using the existing conversion and the existing validation.

 
  I cannot afford not to access the files.
 
 Then you have two choices:
 - Don't use Unicode.
As soon as a Windows system enters the picture, it is practically impossible

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 But, as I once already said, you can do it with UTF-8, you simply
 keep the invalid sequences as they are, and really handle them
 differently only when you actually process them or display them.

UTF-8 is painful to process in the first place. You are making it
even harder by demanding that all functions which process UTF-8 do
something sensible for bytes which don't form valid UTF-8. They even
can't temporarily convert it to UTF-32 for internal processing for
convenience.

 Listing files in a directory should not signal anything. It MUST
 return all files and it should also return them in a way that this
 list can be used to access each of the files.

Which implies that they can't be interpreted as UTF-8.

By masking an error you are not encouraging users to fix it.
Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

 Let's start with UTF-8 usernames. This is a likely scenario, since I
 think UTF-8 will typically be used in network communication. If you
 store the usernames in UTF-16, the conversion will signal an error
 and you will not have any users with invalid UTF-8 sequences nor
 will any invalid sequence be able to match any user. If you later on
 start comparing users somewhere else, in UTF-8, then you must not
 only strcmp them, but also validate each string. This is just a fact
 and I am not complaining about it.

If usernames are supposed to be UTF-8, and in fact they are not,
then it's normal that some software will signal an error instead
of processing them. The proper way is to fix the username database,
not to change programs.

 The interesting thing is that if you do start using my conversion,
 you can actually get rid of the need to validate UTF-8 strings
 in the first scenario. That of course means you will allow users
 with invalid UTF-8 sequences, but if one determines that this is
 acceptable (or even desired), then it makes things easier. But the
 choice is yours.

For me it's not acceptable, so I will not support declaring it valid.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Philippe Verdy wrote:
 An implementation that uses UTF-8 for valid string could use 
 the invalid 
 ranges for lead bytes to encapsultate invalid byte values. 
 Note however that 
 invalid bytes you would need to represent have 256 possible 
 values, but the 
 UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1) 
 each for 64 
 codes, if you want to use an encoding on two bytes. The 
 alternative would be 
 to use the UTF-8 lead byte values which have initially been 
 assigned to byte 
 sequences longer than 4 bytes, and that are now unassigned/invalid in 
 standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
 Here also it will be a private encoding, that should NOT be 
 named UTF-8, and 
 the application should clearly document that it will not only 
 accept any 
 valid Unicode string, but also some invalid data which will have some 
 roundtrip compatibility.
Now you are devising an algorithm to store invalid sequences with other invalid sequences. In UTF-8. Why not simply stick with the original invalid sequences?

And the whole purpose of what I am trying to do is to get VALID sequences. In order to be able to store and manipulate with Unicode strings.

 
 So what is the problem: suppose that the application, 
 internally, starts to 
 generate strings containing any occurences of such private 
 sequences, then 
 it will be possible for the application to generate on its 
 output a byte 
 stream that would NOT have roundtrip compatibility, back to 
 the private 
 representation. So roundtripping would only be guaranteed for streams 
 converted FROM an UTF-8 where some invalid sequences are 
 present and must be 
 preserved by the internal representation. So the 
 transformation is not 
 bijective as you would think, and this potentially creates 
 lots of possible 
 security issues.
Yes, it does. An application that uses my approach needs to be designed accordingly. *IF* the security issues apply. For a UTF-16 text editor this probably doesn't apply (in terms of data, not filenames). And this is just an example, with a text editor you can perhaps force the user to select a different encoding, but there are cases where this cannot be done, but data still needs to be preserved.

So far, many people have suggested that there is no need to preserve 'invalid data'. After some argumentation and a couple of examples, the need is acknowledged. But then they question the way it is done. They see the codepoint approach as unsuitable or unneeded. And suggest using some form of escaping. Now, any escaping has exactly the same problems you are mentioning, and some on top. And is actually representing invalid data with valid codepoints (except more than one per invalid byte), which you say is a definite no-no.

And on top of all, the approach I am proposing is NOT intended to be used everywhere. It should only be used when interfacing to a system that cannot guarantee valid UTF-8, but does use UTF-8. For example, a UNIX filesystem. And, actually, if the security is entirely done by the filesystem, then it doesn't even matter if two UTF-16 strings map to the same filename. They will open the same file. Or be both denied. Which is exactly what is required. A Windows filesystem is case preserving but case insensitive. Did it ever bother you that you can use either upper case or lower case filename to open a file? Does it introduce security issues? Typically no, because you leave the security to the filesystem. And those checks are always done in the same UTF.

This is a simple example of something that doesn't even need to be fixed. There are cases where validation would really need to be fixed. But then again, only if you use the new conversion. If you don't, your security remains exactly where it is today.

We should be analyzing the security aspects. Learning where it can break, and in which cases. Get to know the enemy. And once we understand that things are manageable and not as frigtening as it seems at first, then we can stop using this as an argument against introducing 128 codepoints. People who will find them useful should and will bother with the consequences. Others don't need to and can roundtrip them as today.

So, interpreting the 128 codepoints as 'recreate the original byte sequence' is an option. If you convert from UTF-16 to UTF-8, then you do exactly as you do now. Even I will do the same where I just want to represent Unicode in UTF-8. I will only use this conversion in certain places. The fact that my conversion actually produces UTF-8 from most of Unicode points does not mean it produced UTF-8. The result is just a byte sequence. The same one that I started with when I was replacing invalid sequences with the 128 codepoints. And this is not limited to conversion from 'byte sequence that is mostly UTF-8' to UTF-16. I can (and even should) convert from this byte sequence to UTF-8. Preserving most of it and replacing each byte of invalid sequences

RE: RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: RE: Roundtripping in Unicode






Philippe VERDY wrote:
 If a source sequence is invalid, and you want to preserve it, 
 then this sequence must remain invalid if you change its encoding.
 So there's no need for Unicode to assign valid code points 
 for invalid source data.
Using invalid UTF-16 sequences to represent invalid UTF-8 sequences is a known approach (UTF-8B, if I remember correctly). But this is then not UTF-16 data so you don't gain much. The data is at risk of being rejeted or filtered out at any time. And that misses the whole point.

Specifically, unpaired surrogates that are used in the UTF-8B conversion have additional risks, but that is not the issue now.


 Using PUA space or some unassigned space in Unicode to 
 represent invalid sequences present in a source text will be 
 a severe design error in all cases, because that conversion 
 will not be bejective and could map invalid sequences to 
 valid ones without further notice, changing the status of the 
 original text which should be kept as incorrectly encoded, 
 until explicitly corrected or until the source text is 
 reparsed with another more appriate encoding.
Again, I am not changing the UTF-8 definition. In places where I do decide to interpret the 128 codepoints differently, it is my responsibility to understand the risks. If there is a risk, I can prevent it. If there is no risk, then I don't need to do anything. Thanks for the warning, but may I be allowed to decide whether it applies to me or not? Or will you insist that such codepoints should not be assigned to protect the innocent? Let's stop producing knives. They're dangerous.

 (In fact I also think that mapping invalid sequences to 
 U+FFFD is also an error, because U+FFFD is valid, and the 
 presence of the encoding error in the source is lost, and 
 will not throw exceptions in further processings of the 
 remapped text, unless the application constantly checks for 
 the presence of U+FFFD in the text stream, and all modules in 
 the application explicitly forbids U+FFFD within its interface...)
Generally, no, most definitely not. Your concern is ONLY valid in security related processing. In data processing, you must preserve the data. U+FFFD is a valid codepoint. A certain application may treat it as special, just as another might treat '/' as special. But you are almost suggesting that U+FFFD is invalid and should be signalled all over. When you realize that U+FFFD is just a codepoint, then you will also understand that codepoints for invalid sequences must also be codepoints. Valid codepoints.

I think my ideas are often misunderstood because I speak mainly of using these codepoints for preserving the invalid sequences. Leading to conclusion that I want to corrupt UTF-8. But that is not so. For one, this mechanism is not intended to replace neither decoding UTF-8, nor encoding UTF-8. It is to be used on interfaces that cannot guarantee pure UTF-8 data. And UTF-8 is just an example, one can use the replacement codepoints for preserving bytes in other encodings, for example a 0xA5 in Latin 3.


Lars





Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 And once we understand that things are manageable and not as
 frigtening as it seems at first, then we can stop using this as an
 argument against introducing 128 codepoints. People who will find
 them useful should and will bother with the consequences. Others
 don't need to and can roundtrip them as today.

A person who is against them can't ignore a motion to introduce them,
because if they are introduced, other people / programs will start
feeding our programs arbitrary byte sequences labeled as UTF-8
expecting them to accept the data.

 So, interpreting the 128 codepoints as 'recreate the original byte
 sequence' is an option.

Which guarantees that different programs will have different view of
the validity and meaning of the same data labeled by the same encoding.
Long live standarization.

 Even I will do the same where I just want to represent Unicode in
 UTF-8. I will only use this conversion in certain places.

So it's not just different programs, but even the same program in
different places. Great...

 The fact that my conversion actually produces UTF-8 from most of
 Unicode points does not mean it produced UTF-8.

Increasing the number of encodings means more opportunities of
mislabeling and using wrong libraries to process data (as it works
in most of cases and thus the error is not detected immediately)
and harder life for programs which aim at supporting all data.

Think further than the immediate moment where many people are
performing a transition form something to UTF-8. Look what happened
with the interpretation of HTML in web browsers.

If the standard from the beginning stood firmly at disallowing
guessing what a malformed HTML was supposed to mean, then people
would learn how to produce correct HTML and the interpretation would
be unambiguous. But browsers tried to accept arbitrary contents and
interpret parts of HTML they found there, guessing how errors should
be resolved, being friendly to careless webmasters. The effect is
that too often they submitted a webpage after checking that it works
in their browser, but in fact it had basic syntax errors. Other
browsers interpreted the errors differently, and the page was
inaccessible or looked badly.

When designing XML, they learned from this mistake:
http://www.xml.com/axml/target.html#dt-fatal
http://www.xml.com/axml/notes/Draconian.html

That's why people here reject balkanization of UTF-8 by introducing
variations with subtle differences, like Java-modified UTF-8.

 Inaccessible filenames are something we shouldn't accept. All your
 discussion of non-empty empty directories is just approaching the problem
 from the wrong end. One should fix the root cause, not consequences.

The root cause is that users and programs use different encodings in
different places, and thus Unix filenames can't be unambiguously and
context-freely interpreted as character sequences.

Unfortunately it's hard to fix.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



Re: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY
Lars Kristan wrote: What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status.
You don't need to do that. No Unicode application must assign semantics to unassigned codepoints.
If a source sequence is invalid, and you want to preserve it,then this sequencemust remain invalid if you change its encoding.
So there's no need for Unicode to assign valid code points for invalid source data.
There's enough space *assigned* as invalid (or assigned to non-characters) in all UT forms, that allow an application to create a local conversion scheme which will perform a bijective conversion of invalid sequences:
- for example in UTF-8: trailing bytes 0x80 to 0xBFisolated or in excess, or even the invalid lead bytes 0xF8 to 0xFF
- for example in UTF-16: 0XFFFE, 0x
- for example in UTF-32: same as UTF-16, plus all code units above 0x10
Using PUA space or some unassigned space in Unicode to represent invalid sequences present in a source textwill be a severe designerror in all cases, because that conversion will not be bejective and could map invalid sequences to valid ones without further notice, changing the status of the original text which should be kept as incorrectly encoded, until explicitly corrected or until the source text is reparsed with another more appriate encoding.
(In fact I also think that mapping invalid sequences to U+FFFD is also an error, because U+FFFD is valid, and the presence of the encoding error in the sourceis lost, and will not throw exceptions in further processings of the remapped text, unless the application constantly checks for the presence of U+FFFD in the text stream, and all modules in the application explicitly forbids U+FFFD within its interface...)


Re: Roundtripping in Unicode

2004-12-13 Thread Mark Davis
Ken is absolutely right. It would be theoretically possible to add 128 code
points that would allow one to roundtrip a bytestream after passing through
a UTF-8 = UTF-32 conversion. (For that matter, it would be possible to add
2048 code points that would allow the same for a 16-bit data stream.)

However, these new code points would really be no better than private use
code points, since their interpretation would depend entirely on whatever
was assumed to be the interpretation of the original bytestream. If X
converted a bytestream that was assumed to be a mixture of 8858-7 with UTF-8
into Unicode with these new characters, and handed it off to Y, who
converted the bytestream back assuming that the odd bytes were to be
iso-8859-9, you would get data corruption. X and Y would have to agree on
the interpretation of these odd bytes to avoid that corruption, so it is
really no different than private use (where they also have to agree on the
interpretation).

Mark

- Original Message - 
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Monday, December 13, 2004 13:04
Subject: RE: Roundtripping in Unicode


 Lars Kristan stated:

  I said, the choice is yours. My proposal does not prevent you from doing
it
  your way. You don't need to change anything and it will still work the
way
  it worked before. OK? I just want 128 codepoints so I can make my own
  choice.

 You have them: U+EE80..U+EEFF, which are yours to use (or abuse)
 in an application as you see fit. Just don't expect others outside
 your application to interpret them as you do.

  And once and for all, you can treat those 128 codepoints just as you
  do today.

 A number of people on the list have patiently explained why what
 you are proposing to do fundamentally breaks UTF-8 and its
 relationship to other Unicode encoding forms.

 The chances that you will get the standard extended to incorporate
 these 128 code points and define their mapping to invalid byte
 values in UTF-8 is somewhere between zilch, nada, and nil.

 --Ken







RE: RE: RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: RE: RE: Roundtripping in Unicode





Philippe VERDY wrote:
 I don't think I miss the point. My suggested approach to 
 perform roundtrip conversions between UTF's while keeping all 
 invalid sequences as invalid (for the standard UTFs), is much 
 less risky than converting them to valid codepoints (and by 
 consequence to valid code units, because all valid code 
 points need valid code units in UTF encoding forms).


I still do think you are missing the point. About two years ago I started a similar thread. At that time I was pursuing the use of UTF-8B conversion, which uses one invalid sequence to represent another. It uses unpaired low surrogates. It works rather well, but one of the readers alerted me that I cannot expect that a Unicode database will be able (or, rather, willing) to process such data. Since I am not in a habit of writing every piece of the code myself (or by my team for that matter), I chose to use a third party database. The data that I have is mainly UTF-8, and users expect it to be interpreted as such. But are not expecting purism in the form of rejecting data (filenames) which contain invalid sequences. I am thankful to the person that pointed this out, and I have moved to using PUA. The rest of the responses were much like what I am getting now. Useless. Telling me to reject invalid sequences, telling me to rewrite everything and treat the data as binary. Or use an escaping technique, forgetting that everything they find wrong about the codepoint approach is also true for escaping. Except that escaping has a lot of overhead and that there is an actual risk of those escaping sequences being present in today's files. Not the ones on UNIX, but the ones on Windows. It should work both ways.

 
 The application doing that just preserves the original byte 
 sequences, for its internal needs, but will not expose to 
 other applications or modules such invalid sequences without 
 the same risks: these other modules need their own strategy, 
 and their strategy could simply be rejecting invalid 
 sequences, assuming that all other valid sequences are 
 encoding valid codepoints (this is the risk you take with 
 your proposal to assign valid codepoints to invalid byte 
 sequences in a UTF-8 stream, and a module that would 
 implement your proposal would remove important security features).
Only applications that do use the new conversion need to worry about security issues. And only those of course, that security issues apply to in the first place. All other applications can and should treat those codepoints as letters. And convert them to UTF-8 just as any other valid codepoint. I may have suggested otherwise at some point in time, but this is my current position.

 Note also that once your proposal is implemented, all valid 
 codepoints become convertible across all UTFs, without notice 
 (this is the principle of UTF that they allow transparent 
 conversions between each other).
Existing conversion is not modified. I am explaining how an alternate conversion works simply to prove it is useful. And it does not convert to UTF-8. It converts to byte sequences. And can be used in places where interfacing with such data. For example UNIX filenames. And 'supposedly UTF-8' is not the only case. The same technique can be used on 'supposedly Latin 3' data. The new conversions are used in pairs and existing UTF conversions remain as they are. Any security issues are up to whoever decides to use the new conversions. There are no security issues for those that do not.

 
 Suppose that your proposal is accepted, and that invalid 
 bytes 0xnn in UTF-8 sources (these bytes are necessarily 
 between 0x80 and 0xFF) get encoded to some valid code units 
 U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they 
 become immediately and transparently convertible to valid 
 UTF-16 or even valid UTF-8. Your assumption that the byte 
 sequence will be preserved will be wrong, because each 
 encoded binary byte will become valid sequences of 3 or 4 
 UTF-8 bytes (one lead byte in 0xE0..EF if code points are in 
 the BMP, or in 0xF0..0xF7 if they are in a supplementary 
 plane, and 2 or 3 trail bytes in 0x80..0xBF).
Again, a UTF-8 to UTF-16 converter does not need to (and should not) encode the invalid sequences as valid codepoints. Existing rules apply. Signal, reject, replace with U+FFFD.

 
 How do you think that other applications will treat these 
 sequences: they won't notice that they are originally 
 equivalent to the new valid sequences, and the byte sequence 
 itself would be transmitted across modules without any 
 warning (applications most often don't check whever 
 codepoints are assigned, just that they are valid and 
 properly encoded).
Exactly. This is why nothing breaks. And Unicode application should treat the new codepoints exactly the say it treats them today. Today they are unassigned and are converted according to existing rules. Once they are assigned, they just get some

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Kenneth Whistler wrote:
 Lars Kristan stated:
 
  I said, the choice is yours. My proposal does not prevent 
 you from doing it
  your way. You don't need to change anything and it will 
 still work the way
  it worked before. OK? I just want 128 codepoints so I can 
 make my own
  choice. 
 
 You have them: U+EE80..U+EEFF, which are yours to use (or abuse) 
 in an application as you see fit. Just don't expect others outside
 your application to interpret them as you do.


Well, I DO want someone to interpret them the way I do. And display them. And let them be entered. And not risk a clash with someone else, we are talking about PUA, right?

 
  And once and for all, you can treat those 128 codepoints just as you
  do today.
 
 A number of people on the list have patiently explained why what
 you are proposing to do fundamentally breaks UTF-8 and its
 relationship to other Unicode encoding forms.


It does not. I may have suggested at some point that the conversion from codepoins to UTF-8 should be changed. But I am no longer proposing that. The conversion to and from UTF-8 remains EXACTLY as it is today. I will use my own conversion as I see fit and deal with all the consequences. But I need 128 VALID codepoints. Not in PUA, not in any plane, but in BMP. And just because I say 'I' need, does not mean I am the only one.

One would judge who is right and who is not by the number of responses. But that is definitely not so. A couple of people keep responding and they have more or less the same theme. Which is because it has been rehearsed time and time again. I believe there are people who have long since realized that my claims are correct. But are just afraid to speak up. Also, wherever I win an argument, it is just dropped. In the end all that remains is a 'feeling' by a few people that 'this is not good'.

 
 The chances that you will get the standard extended to incorporate
 these 128 code points and define their mapping to invalid byte
 values in UTF-8 is somewhere between zilch, nada, and nil.


No, not UTF-8. UTF-8 remains as it is. What I will do with them is my business. I am only telling you about it so you cannot dismiss it as 'encapsulating arbitrary binary data in Unicode'.


Lars





RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





 Ken is absolutely right. It would be theoretically possible 
 to add 128 code
 points that would allow one to roundtrip a bytestream after 
 passing through
 a UTF-8 = UTF-32 conversion. (For that matter, it would be 
 possible to add
 2048 code points that would allow the same for a 16-bit data stream.)
You don't really need to add anything for 16-bit = UTF-32. There is no real-life need to have that roundtrip guaranteed. For 8-bit data there is real-life need. And even, for 16-bit = UTF-32 you can do it simply by defining how surrogates should be processed. Not saying it should be done, but showing it could be done. But for UTF-8 = UTF-32 it cannot be done without 128 new codepoints. Which is why I am often comparing these 128 codepoints to the surrogates. With one difference, they should be valid characters.

 
 However, these new code points would really be no better than 
 private use
 code points, since their interpretation would depend entirely 
Oh yes they would. Anyone might be using those same codepoints in PUA for something completely different.


 on whatever
 was assumed to be the interpretation of the original bytestream. If X
 converted a bytestream that was assumed to be a mixture of 
 8858-7 with UTF-8
 into Unicode with these new characters, and handed it off to Y, who
 converted the bytestream back assuming that the odd bytes were to be
 iso-8859-9, you would get data corruption. X and Y would have 
Nope. No data corruption. You just get the odd bytes back. And achieve exactly the same as if X passed the data directly to Y. Y doesn't convert from UTF-8 to iso-8859-9, nor does it convert the odd bytes to iso-8859-9. It converts UTF-8 to the original byte stream and ONLY THEN interpretes it as iso-8859-9. So, the same as if it got the data directly.


Lars





Re: Roundtripping in Unicode

2004-12-13 Thread Arcane Jill
If I have understood this correctly, filenames are not in a locale, they 
are absolute. Users, on the other hand, are in a locale, and users view 
filenames. The same filename can look different to two different users. To 
user A (whose locale is Latin-1), a filename might look valid; to user B 
(whose locale is UTF-8), the same filename might look invalid.

Is that right, Lars?
If so, Marcin, what exactly is the error, and whose fault is it?
Jill
-Original Message-
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Behalf Of Marcin 'Qrczak' Kowalczyk
Sent: 13 December 2004 14:59
To: [EMAIL PROTECTED]
Subject: Re: Roundtripping in Unicode
Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.




Re: RE: Roundtripping in Unicode

2004-12-13 Thread John Cowan
Doug Ewell scripsit:

 When faced with [an] ill-formed code unit sequence while transforming
 or interpreting text, a conformant process must treat the first code
 unit... as an illegally terminated code unit sequence -- for example, by
 signaling an error, filtering the code unit out, or representing the
 code unit with a marker such as U+FFFD REPLACEMENT CHARACTER.

Plan 9, the original all-UTF-8 environment (it was translated
in a single day from Latin-1 to UTF-8), represents ill-formed code unit
sequences with the otherwise useless U+0080, on the grounds that an
ill-formed code is semantically different from an untranslatable
character, which is the purpose of U+FFFD.

-- 
LEAR: Dost thou call me fool, boy?  John Cowan
FOOL: All thy other titles  http://www.ccil.org/~cowan
 thou hast given away:  [EMAIL PROTECTED]
  That thou wast born with. http://www.reutershealth.com



RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Peter Kirk wrote:


 Now no doubt many Unix filename handling utilities ignore the 
 fact that 
 some octets are invalid or uninterpretable in the locale, 
 because they 
 handle filenames as octet strings (with 0x00 and 0x2F having special 
 interpretations) rather than as locale-dependent character 
 strings. But 
 these routines should continue to work in a UTF-8 locale, as 
 they make 
 no attempt to interpret any octets other than 0x00 and 0x2F.


Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basically, you are not supposed to use strcpy to process filenames.

Well, I just hope noone will listen to them and modify strcpy and strchr to validate the data when running in UTF-8 locale and start signalling something (really, where and how?!). The two statements from UTC don't make sense when put together. Unless we are really expected to start building everything from scratch.


 All of this is ingenious, and may be useful for internal processing 
 within a Unix system, and perhaps even for interaction between 
 cooperating systems. But NOT-Unicode is not Unicode (!) and 
 so Unicode 
 should not be expected to standardise it.
Not by definition. But if it would help the users since it would simplify the transition, then why not?



Lars





RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk wrote:
 UTF-8 is painful to process in the first place. You are making it
 even harder by demanding that all functions which process UTF-8 do
 something sensible for bytes which don't form valid UTF-8. They even
 can't temporarily convert it to UTF-32 for internal processing for
 convenience.
My point exactly. I am proposing to provide a conversion so you can. All you need is to assign 128 codepoints and define their properties. They would be printable characters, non-spaces, would have no upper/lower case properties, would collate (for example) after all letters but before any special characters, and so on. Then you don't need to fix anything. Not in the functions. You just need to convert (and even convert from byte stream to UTF-8) on boundaries where you expect such data. And decide whether you need to prevent anything due to security reasons. If not, then you're done.

So, no, I am not demanding that UTF-8 functions need to behave differently. Existing functions work perfectly well, assuming you convert to UTF-8 (so, use three bytes to represent each invalid byte as a valid codepoint). It would be beneficial if they would, but that is a separate issue. It would need to be determined which functions could do so. Maybe all could, maybe only some could, maybe none should. It needs to be investigated before anything is changed. This is in line with what I said about validation. Processing functions may do validation implicitly. But this is not a requirement. Unless you make it so. But in my opinion, it is better to separate validation from processing. In that case you can even prescribe exactly what they should do with invalid data. And in this case they should do exactly what they would do if the data was converted to UTF-8 according to my conversion. But again, this is the next step, that needn't be done at all.

 
  Listing files in a directory should not signal anything. It MUST
  return all files and it should also return them in a way that this
  list can be used to access each of the files.
 
 Which implies that they can't be interpreted as UTF-8.
 
 By masking an error you are not encouraging users to fix it.
 Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
Failure to process such files is also an error. Think virus scanners and backup.


  The interesting thing is that if you do start using my conversion,
  you can actually get rid of the need to validate UTF-8 strings
  in the first scenario. That of course means you will allow users
  with invalid UTF-8 sequences, but if one determines that this is
  acceptable (or even desired), then it makes things easier. But the
  choice is yours.
 
 For me it's not acceptable, so I will not support declaring it valid.
I said, the choice is yours. My proposal does not prevent you from doing it your way. You don't need to change anything and it will still work the way it worked before. OK? I just want 128 codepoints so I can make my own choice. And once and for all, you can treat those 128 codepoints just as you do today.


Lars





Re: RE: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY
 From : Lars Kristan 
 Philippe VERDY wrote: 
  If a source sequence is invalid, and you want to preserve it, 
  then this sequence must remain invalid if you change its encoding. 
  So there's no need for Unicode to assign valid code points 
  for invalid source data. 
 Using invalid UTF-16 sequences to represent invalid UTF-8 sequences is a 
 known approach (UTF-8B, if I remember correctly). But this is then not UTF-16 
 data so you don't gain much. The data is at risk of being rejeted or filtered 
 out at any time. And that misses the whole point.

I don't think I miss the point. My suggested approach to perform roundtrip 
conversions between UTF's while keeping all invalid sequences as invalid (for 
the standard UTFs), is much less risky than converting them to valid codepoints 
(and by consequence to valid code units, because all valid code points need 
valid code units in UTF encoding forms).

The application doing that just preserves the original byte sequences, for its 
internal needs, but will not expose to other applications or modules such 
invalid sequences without the same risks: these other modules need their own 
strategy, and their strategy could simply be rejecting invalid sequences, 
assuming that all other valid sequences are encoding valid codepoints (this is 
the risk you take with your proposal to assign valid codepoints to invalid byte 
sequences in a UTF-8 stream, and a module that would implement your proposal 
would remove important security features).

Note also that once your proposal is implemented, all valid codepoints become 
convertible across all UTFs, without notice (this is the principle of UTF that 
they allow transparent conversions between each other).

Suppose that your proposal is accepted, and that invalid bytes 0xnn in UTF-8 
sources (these bytes are necessarily between 0x80 and 0xFF) get encoded to some 
valid code units U+0mmmnn (in a new range U+mmm80 to U+mmmFF), then they become 
immediately and transparently convertible to valid UTF-16 or even valid UTF-8. 
Your assumption that the byte sequence will be preserved will be wrong, because 
each encoded binary byte will become valid sequences of 3 or 4 UTF-8 bytes (one 
lead byte in 0xE0..EF if code points are in the BMP, or in 0xF0..0xF7 if they 
are in a supplementary plane, and 2 or 3 trail bytes in 0x80..0xBF).

How do you think that other applications will treat these sequences: they won't 
notice that they are originally equivalent to the new valid sequences, and the 
byte sequence itself would be transmitted across modules without any warning 
(applications most often don't check whever codepoints are assigned, just that 
they are valid and properly encoded).

Which application will take the responsability to convert back these 3-4 bytes 
valid sequences back to invalid 1-byte sequences, given that your data will 
already be treated by them as valid, and already encoded with valid UTF code 
units or encoding schemes?

Come back to your filesystem problem. Suppose that there ARE filenames that 
already contain these valid 3-4 byte sequences. This hypothetic application 
will blindly convert the valid 3-4 bytes sequences to invalid 1-byte sequences, 
and then won't be able to access these files, despite they were already 
correctly UTF-8 encoded. So your proposal breaks valid UTF-8 encoding of 
filenames. In addition it creates dangerous aliases that will redirect accesses 
from one filename to another (so yes it is also a security problem).

My opinion is then that we must not allow the conversion of any invalid byte 
sequences to valid code points. All what your application can do is to convert 
them to invalid sequences code units, to preserve the invalid status. Then it's 
up to that application to make this conversion privately and resoring the 
original byte sequence before communicating again with the external system. 
Another process or module can do the same if it wishes to, but none will 
communicate directly to each other with their private code unit sequences. The 
decision to accept invalid byte sequences must remain local to each module and 
is not transmissible.

This means that permanent files containing invalid byte sequences must not be 
converted and replaced to another UTF as long as they contain an invalid byte 
sequence. Such file converter should fail, and warn the user about file 
contents or filenames that could not be converted. Then it's up to the user to 
decide if it wishes to:
- drop these files
- use a filter to remove invalid sequences (if it's a filename, the filter may 
need to append some indexing string to keep filenames unique in a directory)
- use a filter to replace some invad sequences by a user specified valid 
substitution string
- use a filter that will automatically generate valid substitution strings.
- use other programs that will accept and will be able to process invalid files 
as opaque sequences of bytes instead of as a stream of Unicode characters.
- change 

RE: Roundtripping in Unicode

2004-12-13 Thread Kenneth Whistler
Lars Kristan stated:

 I said, the choice is yours. My proposal does not prevent you from doing it
 your way. You don't need to change anything and it will still work the way
 it worked before. OK? I just want 128 codepoints so I can make my own
 choice. 

You have them: U+EE80..U+EEFF, which are yours to use (or abuse) 
in an application as you see fit. Just don't expect others outside
your application to interpret them as you do.

 And once and for all, you can treat those 128 codepoints just as you
 do today.

A number of people on the list have patiently explained why what
you are proposing to do fundamentally breaks UTF-8 and its
relationship to other Unicode encoding forms.

The chances that you will get the standard extended to incorporate
these 128 code points and define their mapping to invalid byte
values in UTF-8 is somewhere between zilch, nada, and nil.

--Ken



Re: RE: Roundtripping in Unicode

2004-12-13 Thread Doug Ewell

Philippe VERDY wrote:

 (In fact I also think that mapping invalid sequences to U+FFFD is also
 an error, because U+FFFD is valid, and the presence of the encoding
 error in the source is lost, and will not throw exceptions in further
 processings of the remapped text, unless the application constantly
 checks for the presence of U+FFFD in the text stream, and all modules
 in the application explicitly forbids U+FFFD within its interface...)

Mapping invalid sequences to U+FFFD is explicitly permitted by
conformance clause C12a (TUS 4.0, p. 61):

When faced with [an] ill-formed code unit sequence while transforming
or interpreting text, a conformant process must treat the first code
unit... as an illegally terminated code unit sequence -- for example, by
signaling an error, filtering the code unit out, or representing the
code unit with a marker such as U+FFFD REPLACEMENT CHARACTER.

Of course, any subsequent process that handles this text would have to
understand this convention, and not choke if handed a U+FFFD.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Roundtripping in Unicode

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 Please make up your mind: either they are valid and programs are
 required to accept them, or they are invalid and programs are required
 to reject them.
 
 I don't know what they should be called. The fact is there shouldn't be any.
 And that current software should treat them as valid. So, they are not valid
 but cannot (and must not) be validated. As stupid as it sounds. I am sure
 one of the standardizers will find a Unicodally correct way of putting it.

I am sure they will not.

There is a tension to migrate from processing strings in terms of
bytes in some vaguely specified encoding to processing them in terms
of code points of a known encoding, or even further: combining
character sequences, graphemes etc.

20 years ago the distinction was moot: a byte was a character, except
for some specialied programs for handling CJK. Today when latin names
with accented characters mixed with cyrillic names are not displayed
correctly or not sorted according to lexicograpic conventions of some
culture, the program can be considered broken. Unfortunately supporting
this requires changing the paradigm. A font with 256 characters with
byte-based rendering engine is not enough for a display, and for
sorting it's no longer enough to compare a byte at a time.

You are trying to stick with processing byte sequences, carefully
preserving the storage format instead of preserving the meaning in
terms of Unicode characters. This leads to less robust software
which is not certain about the encoding of texts it processes and
thus can't apply algorithms like case mapping without risking doing
a meaningless damage to the text.

 Today, two invalid UTF-8 strings compare the same in UTF-16, after a
 valid conversion (using a single replacement char, U+FFFD) and they
 compare different in their original form,

Conversion should signal an error by default. Replacing errors by
U+FFFD should be done only when the data is processed purely for
showing it to the user, without any further processing, i.e. when it's
better to show the text partially even if we know that it's corrupted.

 Either you do everything in UTF-8, or everything in UTF-16. Not
 always, but typically. If comparisons are not always done in the
 same UTF, then you need to validate. And not validate while
 converting, but validate on its own. And now many designers will
 remember that they didn't. So, all UTF-8 programs (of that kind)
 will need to be fixed. Well, might as well adopt my broken
 conversion and fix all UTF-16 programs. Again, of that kind, not all
 in general, so there are few. And even those would not be all
 affected. It would depend on which conversion is used where. Things
 could be worked out. Even if we would start changing all the
 conversions. Even more so if a new conversion is added and only used
 when specifically requested.

I don't understand anything of this.

 I cannot afford not to access the files.

Then you have two choices:
- Don't use Unicode.
- Pretend that filenames are encoded in ISO-8859-1, and represent them
  as a sequence of code points U+0001..U+00FF. They will not be displayed
  correctly but the information will be preserved.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/


Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 It's essential that any UTF-n can be translated to any other without
 loss of data. Because it allows to use an implementation of the given
 functionality which represents data in any form, not necessarily the
 form we have at hand, as long as correctness is concerned. Avoiding
 conversion should matter only for efficiency, not for correctness.
 
 When I am talking about roundtrip, I speak of arbitrary data, not
 just valid data.

You want to declare all byte sequences as valid. And thus valid data
is no longer preserved on round trip, because different UTFs are able
to encode different sequences of code points.

 Roundtrip for valid data is of course essential and needs to be
 preserved.

Your proposal does not do this.

 Unpaired surrogates are not valid UTF-16, and there are no surrogates
 in UTF-8 at all, so there is no point in trying to preserve UTF-16
 which is not really UTF-16.
 
 Actually, there is a point. It is just that you fail to understand it.
 But then, you needn't worry about it, since it is outside of your area
 of interest.

I would worry if my programs would no longer accept what Unicode
considers valid UTF-n. And I would worry if rules defined by Unicode
would make U+ encodable as UTF-n, U+ encodable too, but the
sequence U+ U+ not encodable (because UTF-n would no longer
be usable as a format for serialization of arbitrary strings of valid
code points).

I would also worry if an API, file format or network protocol intended
for use by various programs required a non-standard variant of UTF-n,
because I couldn't use standard UTF-n encoding and decoding functions
to interoperate with it.

I indeed don't worry in what way you abuse UTF-n, as long as it's not
an official Unicode standard and it's not widely used in practice.

 If UTC takes 128 unassigned codepoints and declares them to be a new
 set of surrogates, you needn't worry either (your valid data will
 still convert to any UTF).

No, because it would remove responsibility to not generate such data
and add responsibility to accept them, and thus some programs which
are not currently broken would be broken under changed rules.

 Unless you have a strict validator which already validates unpaired
 surrogates. But you don't. I am pretty sure about it.

I use system-supplied iconv() which does not accept anything which can
be described as unpaired surrogates.

 If a user encounters corrupt data and cannot process it with your
 program, she (she is 'politically correct', but in this case can
 be seen as sexism) will blame it on the program, not the data.

I don't care.

 This has been discussed mails back. UNIX filenames are already 'submitted'.
 Once you set your locale to UTF-8, you have labelled them all as UTF-8.
 Suggestions?

Convert them to be valid UTF-8 (as long as locales used in the system
use UTF-8 as the encoding, that is, otherwise keep them in the locale's
encoding).

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/



RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk wrote:
 Lars Kristan [EMAIL PROTECTED] writes:
 
  The other name for this is roundtripping. Currently, Unicode allows
  a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are
  several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more
  valuable, even if it means that the other roundtrip is no longer
  guaranteed:
 
 It's essential that any UTF-n can be translated to any other without
 loss of data. Because it allows to use an implementation of the given
 functionality which represents data in any form, not necessarily the
 form we have at hand, as long as correctness is concerned. Avoiding
 conversion should matter only for efficiency, not for correctness.
When I am talking about roundtrip, I speak of arbitrary data, not just valid data. Roundtrip for valid data is of course essential and needs to be preserved.

 
  Let me go a bit further. A UTF-16=UTF-8=UTF-16 roundtrip is only
  required for valid codepoints other than the surrogates. But it also
  works for surrogates unless you explicitly and 
 intentionally break it.
 
 Unpaired surrogates are not valid UTF-16, and there are no surrogates
 in UTF-8 at all, so there is no point in trying to preserve UTF-16
 which is not really UTF-16.
Actually, there is a point. It is just that you fail to understand it. But then, you needn't worry about it, since it is outside of your area of interest. So, as far as you are concerned, I can do with surrogates anything I like, right? If UTC takes 128 unassigned codepoints and declares them to be a new set of surrogates, you needn't worry either (your valid data will still convert to any UTF). Unless you have a strict validator which already validates unpaired surrogates. But you don't. I am pretty sure about it.

 
  I would opt for the latter (i.e. keep it working), according to my
  statement (in the thread When to validate) that validation should
  be separated from other processing, where possible.
 
 Surely it should be separated: validation is only necessary when data
 are passed from the external world to our system. Internal operations
 should not produce invalid data from valid data. You don't have to
 check at each point whether data is valid. You can assume that it is
 always valid, as long as the combination of the programming language,
 libraries and the program is not broken.
 
 Some languages make it easier to ensure that strings are valid, to the
 point that they guarantee it (they don't offer any way to construct
 an invalid string). Unfortunately many languages don't: they say that
 they represent strings in UTF-8 or UTF-16, but they are unsafe, they
 do nothing to prevent constructing an array of words which is not
 valid UTF-8 or UTF-16 and passing it to functions which assume that
 it is. Blame these languages, not the definitions of UTF-n.
Blaming solves nothing. In this case it is just a philosophical excercise. If a user encounters corrupt data and cannot process it with your program, she (she is 'politically correct', but in this case can be seen as sexism) will blame it on the program, not the data. The fact that your program conforms to Unicode standard doesn't help you. Another program that doesn't, might work. If the user chooses to use this other program instead of yours, who will you blame?

 
  All this is known and presents no problems, or - only problems that
  can be kept under control. So, by introducing another set of 128
  'surrogates', we don't get a new type of a problem, just another
  instance of a well known one.
 
 Nonsense. UTF-8, UTF-16 and UTF-32 are interchangeable, and you would
 like to break this. No way.
Not in the way you would need to worry about. Did UTF-16 break UCS-2? No, because the codepoints that were assigned to surrogates were not used before. Same thing here.

  On top of it, I repeatedly stressed that it is UTF-8 data 
 that has the
  highest probablility of any of the following:
  * contains portions that are not UTF-8
  * is not really UTF-8, but user has UTF-8 set as default encoding
  * is not really UTF-8, but was marked as such
  * a transmission error not only changes data but also 
 creates invalid
  sequences
 
 In this cases the data is broken and the damage should be signalled as
 soon as possible, so the submitter can know this and correct it.
This has been discussed mails back. UNIX filenames are already 'submitted'. Once you set your locale to UTF-8, you have labelled them all as UTF-8. Suggestions?


Lars





RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk wrote:
 Lars Kristan [EMAIL PROTECTED] writes:
 
  All assigned codepoints do roundtrip even in my concept.
  But unassigned codepoints are not valid data.
 
 Please make up your mind: either they are valid and programs are
 required to accept them, or they are invalid and programs are required
 to reject them.
I don't know what they should be called. The fact is there shouldn't be any. And that current software should treat them as valid. So, they are not valid but cannot (and must not) be validated. As stupid as it sounds. I am sure one of the standardizers will find a Unicodally correct way of putting it.

 
  Furthermore, I was proposing this concept to be used, but not
  unconditionally. So, you can, possibly even should, keep using
  whatever you are using.
 
 So you prefer to make programs misbehave in unpredictable ways
 (when they pass the data from a component which uses relaxed rules
 to a component which uses strict rules) rather than have a clear and
 unambiguous notion of a valid UTF-8?
I am not particulary thrilled about it. In fact it should be discussed. Constructively. Simply assuming everything will break is not helpful. But if you want an answer, yes, I would go for it. Actually, there are fewer concerns involved than people think. Security is definitely an issue. But again, one shouldn't assume it breaks just like that. Let me risk a bold statement: security is typically implicitly centralized. And if comparison is always done in the same UTF, it won't break. A simple fact that two different UTF-16 strings compare equal in UTF-8 (after relaxed conversion), does not introduce a security issue. Today, two invalid UTF-8 strings compare the same in UTF-16, after a valid conversion (using a single replacement char, U+FFFD) and they compare different in their original form, if you use strcmp. But you probably don't. Either you do everything in UTF-8, or everything in UTF-16. Not always, but typically. If comparisons are not always done in the same UTF, then you need to validate. And not validate while converting, but validate on its own. And now many designers will remember that they didn't. So, all UTF-8 programs (of that kind) will need to be fixed. Well, might as well adopt my broken conversion and fix all UTF-16 programs. Again, of that kind, not all in general, so there are few. And even those would not be all affected. It would depend on which conversion is used where. Things could be worked out. Even if we would start changing all the conversions. Even more so if a new conversion is added and only used when specifically requested.

There is cost and there are risks. Nothing should be done hastily. But let's go back and ask ourselves what are the benefits. And evaluate the whole.

 
  Perhaps I can convert mine, but I cannot convert all filenames on
  a user's system.
 
 They you can't access his files.
Yes, this is where it all started. I cannot afford not to access the files. I am not writing a notepad.


 
 With your proposal you couldn't as well, because you don't make them
 valid unconditionally. Some programs would access them and some would
 break, and it's not clear what should be fixed: programs or filenames.
It is important to have a way to write programs that can. And, there is definitely nothing to be fixed about the filenames. They are there and nobody will bother to change them. It is the programs that need to be fixed. And if Unicode needs to be fixed to allow that, then that is what is supposed to happen. Eventually.

Lars





Re: Roundtripping in Unicode

2004-12-11 Thread Doug Ewell
RE: Roundtripping in Unicode
Lars Kristan wrote:

 All assigned codepoints do roundtrip even in my concept.
 But unassigned codepoints are not valid data.

 Please make up your mind: either they are valid and programs are
 required to accept them, or they are invalid and programs are
 required to reject them.

 I don't know what they should be called. The fact is there shouldn't
 be any. And that current software should treat them as valid. So, they
 are not valid but cannot (and must not) be validated. As stupid as it
 sounds. I am sure one of the standardizers will find a Unicodally
 correct way of putting it.

I can't even understand that paragraph, let alone paraphrase it.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/





Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
From: Doug Ewell [EMAIL PROTECTED]
Lars Kristan wrote:
I am sure one of the standardizers will find a Unicodally
correct way of putting it.
I can't even understand that paragraph, let alone paraphrase it.
My understanding of his question and my reponse to his problem is that you 
MUST not use VALID Unicode codepoints to represent INVALID byte sequences 
found in some text with alleged UTF encoding.

The only way is to use INVALID codepoints, out of the Unicode space, and 
then design an encoding scheme that contains and extends the Unicode UTF, 
and make sure that there will be no possible interaction between such 
encoded binary data and encoded plain text (so the conversion between the 
encoding scheme of the bytes stream and the encoding form with code units or 
codepoints in memory must be fully bijective; it is hard to design if you 
have to also support multiple UTF encoding schemes, because the invalid byte 
sequences of these UTF schemes are not the same, and must then be 
represented with distinct invalid codepoints or code units for each external 
UTF!)

I won't support the idea of reserving some valid codepoint in the Unicode 
space to allow storing something which is already considered invalid 
character data, notably because the Unicode standard is evolving, and such 
private encoding form which would work now could become incompatible with a 
later version of the Unicode standard, or a later standardized Unicode 
encoding scheme, meaning that interoperability would be lost...

The only thing for which you have a guarantee that Unicode will not assign a 
mandatory behavior is the codepoint space after U+10 (I'm not sure about 
the permanent invalidity of some code unit spaces in UTF-8 and UTF-16 
encoding forms; also I'm not sure that there will be enough free space in 
later standard encoding forms or schemes, see for example SCSU or BOCU-1, or 
with other already used private encoding forms like the modified UTF-8 
extended encoding scheme defined by Sun in Java).




Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
RE: Roundtripping in UnicodeMy view about this problem of roundtripping is 
that if data, supposed to contain only valid UTF-8 sequences, contains some 
invalid byte sequences that still need to be roundtripped to some code 
point for internal management that can be roundtripped later to the 
original invalid byte sequence, then these invalid bytes MUST NOT be 
converted to valid code points.

An implementation based on internal UTF-32 code units representation could 
use, privately only, only the range which is NOT assigned to valid Unicode 
code points; so such application would need to convert these bytes into code 
points higher than 0x10; but the same application will no longer be 
conforming to strict UTF-32 requirements: the application will represent 
this way binary data which is NOT bound to Unicode rules and which can't be 
valid plain-text.
For example, {0xFF+n} where n is the byte value to encapsulate. Don't 
call it UTF-32, because it MUST remain for private use only!

This will be more complex if the application uses UTF-16 code units, because 
there are only TWO code units that can be used to recognize such 
invalid-text data within a text stream. It is possible to do that, but with 
MUCH care:
For example encoding 0xFFFE before each byte value converted to some 16-bit 
code unit. The problem is that backward parsing of strings just check that a 
code unit is a low surrogate, to see if a second backward step is needed to 
get the first high surrogate, and so U+FFFE would need to be used (privately 
only) as another lead high surrogate with special (internal) meaning for 
round trip compatibility, and so the best choice for the code unit encoding 
the invalid byte value would be to use a standard low surrogate to store 
this byte. So a qualifying internal representation would be {0xFFFE, 
0xDC00+n} where n is the byte value to encapsulate.
Don't call this UTF-16, because it is not UTF-16.

An implementation that uses UTF-8 for valid string could use the invalid 
ranges for lead bytes to encapsultate invalid byte values. Note however that 
invalid bytes you would need to represent have 256 possible values, but the 
UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1) each for 64 
codes, if you want to use an encoding on two bytes. The alternative would be 
to use the UTF-8 lead byte values which have initially been assigned to byte 
sequences longer than 4 bytes, and that are now unassigned/invalid in 
standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
Here also it will be a private encoding, that should NOT be named UTF-8, and 
the application should clearly document that it will not only accept any 
valid Unicode string, but also some invalid data which will have some 
roundtrip compatibility.

So what is the problem: suppose that the application, internally, starts to 
generate strings containing any occurences of such private sequences, then 
it will be possible for the application to generate on its output a byte 
stream that would NOT have roundtrip compatibility, back to the private 
representation. So roundtripping would only be guaranteed for streams 
converted FROM an UTF-8 where some invalid sequences are present and must be 
preserved by the internal representation. So the transformation is not 
bijective as you would think, and this potentially creates lots of possible 
security issues.

So for such application, it would be much more appropriate to use different 
datatypes and structures to represent either streams of binary bytes, or 
streams of characters, and recognize them independantly. The need of a 
bijective representation means that the input stream will contain an 
encapsultation to recognize *exactly* if the stream is text or binary.

If the application is a filesystem storing filenames and there's no place in 
the filesystem to encode if a filename is binary or text, then you are left 
without any secured solution!

So the best thing you can do to secure your application, is to REJECT/IGNORE 
all files whose names do not match the strict UTF-8 encoding rules that your 
application expect (all will happen as if those files were not present, but 
this may still create security problems if an application that does not see 
any file in a directory wants to delete that directory, assuming it is 
empty... In that case the application must be ready to accept the presence 
of directories without any content, and must not depend on the presence of a 
directory to determine that it has some contents; anyway, on secured 
filesystems, such things could happen due to access restrictions, completely 
unrelated to the encoding of filenames, and it is not unreasonnable to 
prepare the application so that it will behave correctly face to 
inaccessible files or directories, so that the application will also 
correctly handle the fact that the same filesystem will contain non 
plain-text and inaccessible filenames).

Anyway, the exposed solutions above demonstrate 

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 The other name for this is roundtripping. Currently, Unicode allows
 a roundtrip UTF-16=UTF-8=UTF-16. For any data. But there are
 several reasons why a UTF-8=UTF-16(32)=UTF-8 roundtrip is more
 valuable, even if it means that the other roundtrip is no longer
 guaranteed:

It's essential that any UTF-n can be translated to any other without
loss of data. Because it allows to use an implementation of the given
functionality which represents data in any form, not necessarily the
form we have at hand, as long as correctness is concerned. Avoiding
conversion should matter only for efficiency, not for correctness.

 Let me go a bit further. A UTF-16=UTF-8=UTF-16 roundtrip is only
 required for valid codepoints other than the surrogates. But it also
 works for surrogates unless you explicitly and intentionally break it.

Unpaired surrogates are not valid UTF-16, and there are no surrogates
in UTF-8 at all, so there is no point in trying to preserve UTF-16
which is not really UTF-16.

 I would opt for the latter (i.e. keep it working), according to my
 statement (in the thread When to validate) that validation should
 be separated from other processing, where possible.

Surely it should be separated: validation is only necessary when data
are passed from the external world to our system. Internal operations
should not produce invalid data from valid data. You don't have to
check at each point whether data is valid. You can assume that it is
always valid, as long as the combination of the programming language,
libraries and the program is not broken.

Some languages make it easier to ensure that strings are valid, to the
point that they guarantee it (they don't offer any way to construct
an invalid string). Unfortunately many languages don't: they say that
they represent strings in UTF-8 or UTF-16, but they are unsafe, they
do nothing to prevent constructing an array of words which is not
valid UTF-8 or UTF-16 and passing it to functions which assume that
it is. Blame these languages, not the definitions of UTF-n.

 A UTF-32=UTF-8=UTF-32 roundtrip is similar, except that 16-8-16 works even
 with concatenation, while 32-8-32 can be broken with concatenation.

It always works as long as data was really UTF-32 at the first place.
A word with a value of 0xD800 is not UTF-32.

 All this is known and presents no problems, or - only problems that
 can be kept under control. So, by introducing another set of 128
 'surrogates', we don't get a new type of a problem, just another
 instance of a well known one.

Nonsense. UTF-8, UTF-16 and UTF-32 are interchangeable, and you would
like to break this. No way.

 On the other hand, UTF-8=UTF-16=UTF-8 as well as UTF-8=UTF-32=UTF-8
 can be both achieved, with no exceptions. This is something no other
 roundtrip can offer at the moment.

But they do! An isolated byte with the highest bit set is not UTF-8,
so there is no point in converting it to UTF-16 and back.

 On top of it, I repeatedly stressed that it is UTF-8 data that has the
 highest probablility of any of the following:
 * contains portions that are not UTF-8
 * is not really UTF-8, but user has UTF-8 set as default encoding
 * is not really UTF-8, but was marked as such
 * a transmission error not only changes data but also creates invalid
 sequences

In this cases the data is broken and the damage should be signalled as
soon as possible, so the submitter can know this and correct it.

Alternatively you keep the original byte sequence, but don't pretend
that it's UTF-8. Delete the erroneous UTF-8 label instead of changing
the data.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/


RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode





Marcin 'Qrczak' Kowalczyk wrote:


  Roundtrip for valid data is of course essential and needs to be
  preserved.
 
 Your proposal does not do this.
All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data. Perhaps it should be stated using some other words, but there shouldn't be any in your data. Right?

Furthermore, I was proposing this concept to be used, but not unconditionally. So, you can, possibly even should, keep using whatever you are using.

 
  If a user encounters corrupt data and cannot process it with your
  program, she (she is 'politically correct', but in this case can
  be seen as sexism) will blame it on the program, not the data.
 
 I don't care.
If you don't, then the guy trying to sell your program will. Eventually, you will, too.


 
  This has been discussed mails back. UNIX filenames are 
 already 'submitted'.
  Once you set your locale to UTF-8, you have labelled them 
 all as UTF-8.
  Suggestions?
 
 Convert them to be valid UTF-8 (as long as locales used in the system
 use UTF-8 as the encoding, that is, otherwise keep them in 
 the locale's
 encoding).
Perhaps I can convert mine, but I cannot convert all filenames on a user's system. Other suggestions?



Lars





Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan [EMAIL PROTECTED] writes:

 All assigned codepoints do roundtrip even in my concept.
 But unassigned codepoints are not valid data.

Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are required
to reject them.

 Furthermore, I was proposing this concept to be used, but not
 unconditionally. So, you can, possibly even should, keep using
 whatever you are using.

So you prefer to make programs misbehave in unpredictable ways
(when they pass the data from a component which uses relaxed rules
to a component which uses strict rules) rather than have a clear and
unambiguous notion of a valid UTF-8?

 Perhaps I can convert mine, but I cannot convert all filenames on
 a user's system.

They you can't access his files.

With your proposal you couldn't as well, because you don't make them
valid unconditionally. Some programs would access them and some would
break, and it's not clear what should be fixed: programs or filenames.

-- 
   __( Marcin Kowalczyk
   \__/   [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/