RE: Roundtripping in Unicode

2004-12-16 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > Yes, IMHO all general-purpose languages should support processing > arrays of bytes, in addition to Unicode strings. C is likely to retain the behavior of the str functions. Although, it puts a lot

Re: Roundtripping in Unicode

2004-12-15 Thread Mark Davis
low this thread in any detail.) âMark - Original Message - From: Lars Kristan To: 'Mark Davis' ; Kenneth Whistler Cc: [EMAIL PROTECTED] Sent: Tuesday, December 14, 2004 03:30 Subject: RE: Roundtripping in Unicode > Ken is absolutely right. It would be theoretically possible &

RE: Roundtripping in Unicode

2004-12-15 Thread Mike Ayers
Title: RE: Roundtripping in Unicode > From: Peter Kirk [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, December 15, 2004 3:52 AM > But surely octets 0x80 to 0x9f are (at least mostly) invalid > in ISO 8859?     They are in fact valid.  However, because they are contro

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > If one application switches from standard UTF-8 to your modification, > and another application continues to use standard UTF-8, then the > ability to pass arbitrary Unicode strings between them by se

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > OK, strcpy does not need to interpret UTF-8. But strchr probably should. No. Its argument is a byte, even though it's passed as type int. By "byte" here I mean "C char value, which is an octet in virtually all modern C implementations; the C standard doe

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Kenneth Whistler wrote: > Lars said: > > > According to UTC, you need to keep processing > > the UNIX filenames as BINARY data. And, also according to > UTC, any UTF-8 > > function is allowed to reject invalid sequences

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode D. Starner wrote: > The only solution is (a) to use ASCII or (b) to make the > switch over as quick > and clean as possible. Anyone who wants to create new files > in UTF-8 and leave > their old files in the old encoding is ask

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Ops, correction: In response to Marcin 'Qrczak' Kowalczyk >> Question: should a new programming language which uses Unicode for >> string representation allow non-characters in strings? Argument for >> allowing them: o

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Arcane Jill wrote: > The obvious solution is for all Unix machines everywhere to > be using the > same locale - and it had better be UTF-8. But an instantaneous global > switch-over is never going to happen, so we see this gradual &

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Philippe Verdy wrote: > I have not > found a solution to this problem, and I don't know if such > solution even > exists; if such solution exists, it should be quite complex...). I think it should be possible to mathematically prov

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > Unix makes is possible for /you/ to change /your/ locale - but by > your reasoning, this is an error, unless all other users do so > simultaneously. Not necessarily: you can change the locale as long as it uses the same default encoding. By "error" I m

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk replied: > "Arcane Jill" <[EMAIL PROTECTED]> writes: > > > If so, Marcin, what exactly is the error, and whose fault is it? > > It's an error to use locales with different encodin

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > Now, it is true that data from two applications using this technique can > become intermixed. But this is not something we should fear. On the > contrary, this is why I do what to standardize the approach. Because in most > cases what will happen is exact

RE: Roundtripping in Unicode

2004-12-15 Thread Arcane Jill
-Original Message- From: [EMAIL PROTECTED] On Behalf Of Philippe Verdy Sent: 14 December 2004 22:47 To: Marcin 'Qrczak' Kowalczyk Cc: [EMAIL PROTECTED] Subject: Re: Roundtripping in Unicode From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Ar

Re: Roundtripping in Unicode

2004-12-15 Thread Peter Kirk
On 15/12/2004 00:22, Mike Ayers wrote: > From: Peter Kirk [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 14, 2004 3:37 PM > Thanks for the clarification. Perhaps the bifurcation could > be better expressed as into "strings of characters as defined > by the locale" and "strings of non-null octe

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 -> > NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an > awkward way which would happen to exclude those su

RE: Roundtripping in Unicode

2004-12-15 Thread D. Starner
"Arcane Jill" writes: > The obvious solution is for all Unix machines everywhere to be using the same > locale - and it > had better be UTF-8. But an instantaneous global switch-over is never going > to happen, so we see > this gradual switch-over ... and it is during this transition phase tha

Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan
Mike Ayers scripsit: > I thought that URLs were specified to be in Unicode. Am I mistaken? You are. URLs are specified to be in *ASCII*. There is a %-encoding hack that allows you to represent random-octet filenames as ASCII. Some people (including me) think it's a good idea to use this

RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode > From: Peter Kirk [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, December 14, 2004 3:37 PM > Thanks for the clarification. Perhaps the bifurcation could > be better expressed as into "strings of characters as defined > by the locale&

RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] On Behalf Of Mike Ayers > Sent: Tuesday, December 14, 2004 3:29 PM > The rule is "No zero, no eight".     "No zero, no forty seven".     My ba

RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] On Behalf Of Philippe Verdy > Sent: Tuesday, December 14, 2004 2:47 PM > More simply, I think that it's an error to have the encoding > part of any locale... The system shoul

Re: Roundtripping in Unicode

2004-12-14 Thread Doug Ewell
> Unicode did not invent the notion of conformance to character > encoding standards. What is new about Unicode is that it has > *3* interoperable character encoding forms, not just one, and > all of them are unusual in some way, because they are designed > for a very, very large encoded character

Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Lars Kristan <[EMAIL PROTECTED]> writes: Hm, here lies the catch. According to UTC, you need to keep processing the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 function is allowed to reject invalid sequences. Basical

Re: Roundtripping in Unicode

2004-12-14 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Arcane Jill" <[EMAIL PROTECTED]> writes: If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. More simply, I think that it's an error to have the encoding par

Re: Roundtripping in Unicode

2004-12-14 Thread Kenneth Whistler
Marcin Kowalczyk noted: > Unicode has the following property. Consider sequences of valid > Unicode characters: from the range U+..U+10, excluding > non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and > U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded > in any

RE: Roundtripping in Unicode

2004-12-14 Thread Mike Ayers
Title: RE: Roundtripping in Unicode > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED]] On Behalf Of Peter Kirk > Sent: Tuesday, December 14, 2004 11:32 AM > This is a design flaw in Unix, or in how it is explained to > users. Well, Lars wrote "Basically, you

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > If so, Marcin, what exactly is the error, and whose fault is it? It's an error to use locales with different encodings on the same system. -- __("< Marcin Kowalczyk \__/ [EMAIL PROTECTED] ^^ http://qrnik.knm.org.pl/~qrczak/

RE: Roundtripping in Unicode

2004-12-14 Thread Kenneth Whistler
Lars said: > According to UTC, you need to keep processing > the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8 > function is allowed to reject invalid sequences. Basically, you are not > supposed to use strcpy to process filenames. This is a very misleading set of statement

Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk
On 14/12/2004 11:32, Arcane Jill wrote: I've been following this thread for a while, and I've pretty much got the hang of the issues here. To summarize: I haven't followed everything, but here is my 2 cents worth. I note that there is a real problem. I have had significant problems in Windows wi

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > Hm, here lies the catch. According to UTC, you need to keep > processing the UNIX filenames as BINARY data. And, also according > to UTC, any UTF-8 function is allowed to reject invalid sequences. > Basically, you are not supposed to use strcpy to pro

Re: Roundtripping in Unicode

2004-12-14 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 -> > NOT-UTF-16 -> NOT-UTF-8 But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 -> NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an awkward way which would h

Re: Roundtripping in Unicode

2004-12-14 Thread Peter Kirk
On 14/12/2004 17:47, John Cowan wrote: Peter Kirk scripsit: I think the problem here is that a Unix filename is a string of octets, not of characters. And so it should not be converted into another encoding form as if it is characters; it should be processed at a quite different level of inte

Re: Roundtripping in Unicode

2004-12-14 Thread John Cowan
Peter Kirk scripsit: > I think the problem here is that a Unix filename is a string of octets, > not of characters. And so it should not be converted into another > encoding form as if it is characters; it should be processed at a quite > different level of interpretation. Unfortunately, that

UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode)

2004-12-14 Thread Edward H. Trager
> > Is that right, Lars? > > If so, Marcin, what exactly is the error, and whose fault is it? > > Jill > > -Original Message- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > Behalf Of Marcin 'Qrczak' Kowalczyk > > Sent: 13 December 2004 14:59 > > To: [EMAIL PROTECTED] > > Subject: Re: Roundtripping in Unicode > > Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error. > > > > > >

RE: Roundtripping in Unicode

2004-12-14 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Arcane Jill wrote: > I've been following this thread for a while, and I've pretty Thanks for bearing with me. And I hope my response will not discourage you from continuing to do so. That is, until I am banned from the list for heresy. &

RE: Roundtripping in Unicode

2004-12-14 Thread Arcane Jill
I've been following this thread for a while, and I've pretty much got the hang of the issues here. To summarize: Unix filenames consist of an arbitrary sequence of octets, excluding 0x00 and 0x2F. How they are /displayed/ to any given user depends on that user's locale setting. In this scenario

Re: RE: Roundtripping in Unicode

2004-12-13 Thread Doug Ewell
Philippe VERDY wrote: > (In fact I also think that mapping invalid sequences to U+FFFD is also > an error, because U+FFFD is valid, and the presence of the encoding > error in the source is lost, and will not throw exceptions in further > processings of the remapped text, unless the application c

Re: Roundtripping in Unicode

2004-12-13 Thread Philippe Verdy
nvalid sequences of 8-bit or 32-bit code units. - Original Message - From: "Mark Davis" <[EMAIL PROTECTED]> To: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Monday, December 13, 2004 11:04 PM Subject: Re

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Peter Kirk wrote: > Now no doubt many Unix filename handling utilities ignore the > fact that > some octets are invalid or uninterpretable in the locale, > because they > handle filenames as octet strings (with 0x00 and 0x2F

Re: RE: Roundtripping in Unicode

2004-12-13 Thread John Cowan
Doug Ewell scripsit: > "When faced with [an] ill-formed code unit sequence while transforming > or interpreting text, a conformant process must treat the first code > unit... as an illegally terminated code unit sequence -- for example, by > signaling an error, filtering the code unit out, or repr

Re: Roundtripping in Unicode

2004-12-13 Thread Arcane Jill
x27; Kowalczyk Sent: 13 December 2004 14:59 To: [EMAIL PROTECTED] Subject: Re: Roundtripping in Unicode Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.

Re: Roundtripping in Unicode

2004-12-13 Thread Mark Davis
ferent than private use (where they also have to agree on the interpretation). âMark - Original Message - From: "Kenneth Whistler" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Monday, December 13, 2004 13:04 Subject: RE: Roundtr

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode > Ken is absolutely right. It would be theoretically possible > to add 128 code > points that would allow one to roundtrip a bytestream after > passing through > a UTF-8 <=> UTF-32 conversion. (For that matter, it would be > pos

RE: Roundtripping in Unicode

2004-12-13 Thread Kenneth Whistler
Lars Kristan stated: > I said, the choice is yours. My proposal does not prevent you from doing it > your way. You don't need to change anything and it will still work the way > it worked before. OK? I just want 128 codepoints so I can make my own > choice. You have them: U+EE80..U+EEFF, which a

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Kenneth Whistler wrote: > Lars Kristan stated: > > > I said, the choice is yours. My proposal does not prevent > you from doing it > > your way. You don't need to change anything and it will > still work the way > > it

RE: RE: RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: RE: RE: Roundtripping in Unicode Philippe VERDY wrote: > I don't think I miss the point. My suggested approach to > perform roundtrip conversions between UTF's while keeping all > invalid sequences as invalid (for the standard UTFs), is much > less risky t

Re: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY
Lars Kristan wrote:> What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status. You don't need to do that. No Unicode application must assign semantics to unassigned codepoints. If a source sequence is invalid, and you

Re: RE: RE: Roundtripping in Unicode

2004-12-13 Thread Philippe VERDY
> From : "Lars Kristan" > Philippe VERDY wrote: > > If a source sequence is invalid, and you want to preserve it, > > then this sequence must remain invalid if you change its encoding. > > So there's no need for Unicode to assign valid code points > > for invalid source data. > Using invalid

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > UTF-8 is painful to process in the first place. You are making it > even harder by demanding that all functions which process UTF-8 do > something sensible for bytes which don't form valid

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > And once we understand that things are manageable and not as > frigtening as it seems at first, then we can stop using this as an > argument against introducing 128 codepoints. People who will find > them useful should and will bother with the consequence

RE: RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: RE: Roundtripping in Unicode Philippe VERDY wrote: > If a source sequence is invalid, and you want to preserve it, > then this sequence must remain invalid if you change its encoding. > So there's no need for Unicode to assign valid code points > for invalid s

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Philippe Verdy wrote: > An implementation that uses UTF-8 for valid string could use > the invalid > ranges for lead bytes to encapsultate invalid byte values. > Note however that > invalid bytes you would need to represent have 256 pos

Re: Roundtripping in Unicode

2004-12-13 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > But, as I once already said, you can do it with UTF-8, you simply > keep the invalid sequences as they are, and really handle them > differently only when you actually process them or display them. UTF-8 is painful to process in the first place. You are

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Philippe Verdy wrote: > From: "Doug Ewell" <[EMAIL PROTECTED]> > > Lars Kristan wrote: > >> I am sure one of the standardizers will find a Unicodally > >> correct way of putting it. > > > > I can&#x

RE: Roundtripping in Unicode

2004-12-13 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > You are trying to stick with processing byte sequences, carefully > preserving the storage format instead of preserving the meaning in > terms of Unicode characters. This leads to less robust soft

Re: Roundtripping in Unicode

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: >> Please make up your mind: either they are valid and programs are >> required to accept them, or they are invalid and programs are required >> to reject them. > > I don't know what they should be called. The fact is there shouldn't be any. > And that cur

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
RE: Roundtripping in UnicodeMy view about this "problem" of roundtripping is that if data, supposed to contain only valid UTF-8 sequences, contains some invalid byte sequences that still need to be roundtripped to some "code point" for internal management that can be roundtripped later to the o

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
From: "Doug Ewell" <[EMAIL PROTECTED]> Lars Kristan wrote: I am sure one of the standardizers will find a Unicodally correct way of putting it. I can't even understand that paragraph, let alone paraphrase it. My understanding of his question and my reponse to his problem is that you MUST not use V

Re: Roundtripping in Unicode

2004-12-11 Thread Doug Ewell
RE: Roundtripping in Unicode Lars Kristan wrote: >>> All assigned codepoints do roundtrip even in my concept. >>> But unassigned codepoints are not valid data. >> >> Please make up your mind: either they are valid and programs are >> required to accept them,

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > Lars Kristan <[EMAIL PROTECTED]> writes: > > > All assigned codepoints do roundtrip even in my concept. > > But unassigned codepoints are not valid data. > > Please make up your

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > All assigned codepoints do roundtrip even in my concept. > But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject the

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > > Roundtrip for valid data is of course essential and needs to be > > preserved. > > Your proposal does not do this. All assigned codepoints do roundtrip even in my concept. But unassigned c

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > Lars Kristan <[EMAIL PROTECTED]> writes: > > > The other name for this is roundtripping. Currently, Unicode allows > > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: >> It's essential that any UTF-n can be translated to any other without >> loss of data. Because it allows to use an implementation of the given >> functionality which represents data in any form, not necessarily the >> form we have at hand, as long as corr

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > The other name for this is roundtripping. Currently, Unicode allows > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are > several reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more > valuable, even if it means that the other roundtrip i