Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Peter Kirk <[EMAIL PROTECTED]> writes: > Jill, again your solution is ingenious. But would it not work just > as well to for Lars' purposes to use, instead of your string of > random characters, just ONE reserved code point followed by U+0xx? > Instead of asking the UTC to allocate a specific code

Re: Roundtripping Solved

2004-12-15 Thread Peter Kirk
On 15/12/2004 14:36, Arcane Jill wrote: Yes, but only if you can have some reasonable assurance that the byte sequence emitted by UTF(c,x) (where c is the single reserved codepoint you suggest, and x is U+00xx, the value to be escaped expressed as a character) will not occur in plain text. This

Re: Roundtripping in Unicode

2004-12-15 Thread Mark Davis
> Nope. No data corruption. You just get the odd bytes back. And achieve I see more of what you are trying to do; let me try to be more clear. Suppose that the conversion is defined in the following way, between Unicode strings (D29a-d, page 74) and UTFs using your proposed new characters, for now

RE: Roundtripping Solved

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping Solved Arcane Jill wrote: > solution, again without breaking the Unicode model. If I have > It is for reasons of requirement (4) that Lars proposes the > introduction of > 128 BMP codepoints. His intention is that they be marked as > "reserved - do > not use"

Re: Roundtripping Solved

2004-12-15 Thread Doug Ewell
Arcane Jill wrote: > DEFINITION - "f" is a function which maps an arbitrary octet stream to > a sequence of Unicode characters, such that (1) any substring which > happens to be valid UTF-8 is mapped to the sequence of Unicode > characters which would have been produced by UTF-8, and (2) all > re

RE: Roundtripping in Unicode

2004-12-15 Thread Mike Ayers
Title: RE: Roundtripping in Unicode > From: Peter Kirk [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, December 15, 2004 3:52 AM > But surely octets 0x80 to 0x9f are (at least mostly) invalid > in ISO 8859?     They are in fact valid.  However, because they are control characters, the

Re: Roundtripping Solved

2004-12-15 Thread Arcane Jill
Yes, but only if you can have some reasonable assurance that the byte sequence emitted by UTF(c,x) (where c is the single reserved codepoint you suggest, and x is U+00xx, the value to be escaped expressed as a character) will not occur in plain text. This is theoretically checkable - the total

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > If one application switches from standard UTF-8 to your modification, > and another application continues to use standard UTF-8, then the > ability to pass arbitrary Unicode strings between them by serializing > them to UTF

Re: Roundtripping Solved

2004-12-15 Thread Doug Ewell
Marcin 'Qrczak' Kowalczyk wrote: >> OBSERVATION - Requirement (4) is not met absolutely, however, >> the probability of the UTF-8 encoding of this sequence occuring >> "accidently" at an arbitrary offset in an arbitrary octet stream >> is approximately one in 2^384; > > Assuming that the distribu

Re: Roundtripping Solved

2004-12-15 Thread Peter Kirk
On 15/12/2004 11:11, Arcane Jill wrote: I followed (and understood) Lar's explanation as to why the NOT- solution wouldn't work for him. Shame really - but here's another bash at a solution, again without breaking the Unicode model. If I have understood this correctly, these are Lars' requir

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > OK, strcpy does not need to interpret UTF-8. But strchr probably should. No. Its argument is a byte, even though it's passed as type int. By "byte" here I mean "C char value, which is an octet in virtually all modern C implementations; the C standard doe

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Kenneth Whistler wrote: > Lars said: > > > According to UTC, you need to keep processing > > the UNIX filenames as BINARY data. And, also according to > UTC, any UTF-8 > > function is allowed to reject invalid sequences. Basically, > you are not > > suppo

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode D. Starner wrote: > The only solution is (a) to use ASCII or (b) to make the > switch over as quick > and clean as possible. Anyone who wants to create new files > in UTF-8 and leave > their old files in the old encoding is asking for trouble. > There's

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Ops, correction: In response to Marcin 'Qrczak' Kowalczyk >> Question: should a new programming language which uses Unicode for >> string representation allow non-characters in strings? Argument for >> allowing them: otherwise they are completely use

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Arcane Jill wrote: > The obvious solution is for all Unix machines everywhere to > be using the > same locale - and it had better be UTF-8. But an instantaneous global > switch-over is never going to happen, so we see this gradual > switch-over ... > an

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Philippe Verdy wrote: > I have not > found a solution to this problem, and I don't know if such > solution even > exists; if such solution exists, it should be quite complex...). I think it should be possible to mathematically prove that it doesn't exi

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > Unix makes is possible for /you/ to change /your/ locale - but by > your reasoning, this is an error, unless all other users do so > simultaneously. Not necessarily: you can change the locale as long as it uses the same default encoding. By "error" I m

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk replied: > "Arcane Jill" <[EMAIL PROTECTED]> writes: > > > If so, Marcin, what exactly is the error, and whose fault is it? > > It's an error to use locales with different encodings on the same > system. U, and whose fault i

Re: Roundtripping in Unicode

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > Now, it is true that data from two applications using this technique can > become intermixed. But this is not something we should fear. On the > contrary, this is why I do what to standardize the approach. Because in most > cases what will happen is exact

Re: Roundtripping Solved

2004-12-15 Thread Marcin 'Qrczak' Kowalczyk
"Arcane Jill" <[EMAIL PROTECTED]> writes: > OBSERVATION - Requirement (4) is not met absolutely, however, > the probability of the UTF-8 encoding of this sequence occuring > "accidently" at an arbitrary offset in an arbitrary octet stream > is approximately one in 2^384; Assuming that the distrib

RE: Roundtripping in Unicode

2004-12-15 Thread Arcane Jill
-Original Message- From: [EMAIL PROTECTED] On Behalf Of Philippe Verdy Sent: 14 December 2004 22:47 To: Marcin 'Qrczak' Kowalczyk Cc: [EMAIL PROTECTED] Subject: Re: Roundtripping in Unicode From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Arcane Jill" <[EMAIL PROTECTED]> writes: If so

Re: Roundtripping in Unicode

2004-12-15 Thread Peter Kirk
On 15/12/2004 00:22, Mike Ayers wrote: > From: Peter Kirk [mailto:[EMAIL PROTECTED] > Sent: Tuesday, December 14, 2004 3:37 PM > Thanks for the clarification. Perhaps the bifurcation could > be better expressed as into "strings of characters as defined > by the locale" and "strings of non-null octe

RE: Roundtripping in Unicode

2004-12-15 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 -> > NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an > awkward way which would happen to exclude those subsequences of > non-characters which

RE: Roundtripping in Unicode

2004-12-15 Thread D. Starner
"Arcane Jill" writes: > The obvious solution is for all Unix machines everywhere to be using the same > locale - and it > had better be UTF-8. But an instantaneous global switch-over is never going > to happen, so we see > this gradual switch-over ... and it is during this transition phase tha

RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtri pping in Unicode)

2004-12-15 Thread Lars Kristan
Title: RE: UTF-8 vs. Non-UTF-8 Locales and File Names (WAS: Re: Roundtripping in Unicode) Edward H. Trager wrote: > UTF-8's home directory).  So both users could probably guess > the filename > they were looking at. Which, BTW, is true for most of Europe but is not true for some other combina

Roundtripping Solved

2004-12-15 Thread Arcane Jill
I followed (and understood) Lar's explanation as to why the NOT- solution wouldn't work for him. Shame really - but here's another bash at a solution, again without breaking the Unicode model. If I have understood this correctly, these are Lars' requirements: 1) There exists a function, f()