Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> Yes, IMHO all general-purpose languages should support processing
> arrays of bytes, in addition to Unicode strings.
C is likely to retain the behavior of the str functions. Although, it puts a lot
low this thread in any detail.)
âMark
- Original Message -
From: Lars Kristan
To: 'Mark Davis' ; Kenneth Whistler
Cc: [EMAIL PROTECTED]
Sent: Tuesday, December 14, 2004 03:30
Subject: RE: Roundtripping in Unicode
> Ken is absolutely right. It would be theoretically possible
&
Title: RE: Roundtripping in Unicode
> From: Peter Kirk [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, December 15, 2004 3:52 AM
> But surely octets 0x80 to 0x9f are (at least mostly) invalid
> in ISO 8859?
They are in fact valid. However, because they are contro
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> If one application switches from standard UTF-8 to your modification,
> and another application continues to use standard UTF-8, then the
> ability to pass arbitrary Unicode strings between them by se
Lars Kristan <[EMAIL PROTECTED]> writes:
> OK, strcpy does not need to interpret UTF-8. But strchr probably should.
No. Its argument is a byte, even though it's passed as type int.
By "byte" here I mean "C char value, which is an octet in virtually
all modern C implementations; the C standard doe
Title: RE: Roundtripping in Unicode
Kenneth Whistler wrote:
> Lars said:
>
> > According to UTC, you need to keep processing
> > the UNIX filenames as BINARY data. And, also according to
> UTC, any UTF-8
> > function is allowed to reject invalid sequences
Title: RE: Roundtripping in Unicode
D. Starner wrote:
> The only solution is (a) to use ASCII or (b) to make the
> switch over as quick
> and clean as possible. Anyone who wants to create new files
> in UTF-8 and leave
> their old files in the old encoding is ask
Title: RE: Roundtripping in Unicode
Ops, correction:
In response to Marcin 'Qrczak' Kowalczyk
>> Question: should a new programming language which uses Unicode for
>> string representation allow non-characters in strings? Argument for
>> allowing them: o
Title: RE: Roundtripping in Unicode
Arcane Jill wrote:
> The obvious solution is for all Unix machines everywhere to
> be using the
> same locale - and it had better be UTF-8. But an instantaneous global
> switch-over is never going to happen, so we see this gradual
&
Title: RE: Roundtripping in Unicode
Philippe Verdy wrote:
> I have not
> found a solution to this problem, and I don't know if such
> solution even
> exists; if such solution exists, it should be quite complex...).
I think it should be possible to mathematically prov
"Arcane Jill" <[EMAIL PROTECTED]> writes:
> Unix makes is possible for /you/ to change /your/ locale - but by
> your reasoning, this is an error, unless all other users do so
> simultaneously.
Not necessarily: you can change the locale as long as it uses the same
default encoding.
By "error" I m
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk replied:
> "Arcane Jill" <[EMAIL PROTECTED]> writes:
>
> > If so, Marcin, what exactly is the error, and whose fault is it?
>
> It's an error to use locales with different encodin
Lars Kristan <[EMAIL PROTECTED]> writes:
> Now, it is true that data from two applications using this technique can
> become intermixed. But this is not something we should fear. On the
> contrary, this is why I do what to standardize the approach. Because in most
> cases what will happen is exact
-Original Message-
From: [EMAIL PROTECTED] On Behalf Of Philippe Verdy
Sent: 14 December 2004 22:47
To: Marcin 'Qrczak' Kowalczyk
Cc: [EMAIL PROTECTED]
Subject: Re: Roundtripping in Unicode
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
"Ar
On 15/12/2004 00:22, Mike Ayers wrote:
> From: Peter Kirk [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, December 14, 2004 3:37 PM
> Thanks for the clarification. Perhaps the bifurcation could
> be better expressed as into "strings of characters as defined
> by the locale" and "strings of non-null octe
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
> NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
> awkward way which would happen to exclude those su
"Arcane Jill" writes:
> The obvious solution is for all Unix machines everywhere to be using the same
> locale - and it
> had better be UTF-8. But an instantaneous global switch-over is never going
> to happen, so we see
> this gradual switch-over ... and it is during this transition phase tha
Mike Ayers scripsit:
> I thought that URLs were specified to be in Unicode. Am I mistaken?
You are. URLs are specified to be in *ASCII*. There is a %-encoding
hack that allows you to represent random-octet filenames as ASCII.
Some people (including me) think it's a good idea to use this
Title: RE: Roundtripping in Unicode
> From: Peter Kirk [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, December 14, 2004 3:37 PM
> Thanks for the clarification. Perhaps the bifurcation could
> be better expressed as into "strings of characters as defined
> by the locale&
Title: RE: Roundtripping in Unicode
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]] On Behalf Of Mike Ayers
> Sent: Tuesday, December 14, 2004 3:29 PM
> The rule is "No zero, no eight".
"No zero, no forty seven".
My ba
Title: RE: Roundtripping in Unicode
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]] On Behalf Of Philippe Verdy
> Sent: Tuesday, December 14, 2004 2:47 PM
> More simply, I think that it's an error to have the encoding
> part of any locale... The system shoul
> Unicode did not invent the notion of conformance to character
> encoding standards. What is new about Unicode is that it has
> *3* interoperable character encoding forms, not just one, and
> all of them are unusual in some way, because they are designed
> for a very, very large encoded character
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
Lars Kristan <[EMAIL PROTECTED]> writes:
Hm, here lies the catch. According to UTC, you need to keep
processing the UNIX filenames as BINARY data. And, also according
to UTC, any UTF-8 function is allowed to reject invalid sequences.
Basical
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
"Arcane Jill" <[EMAIL PROTECTED]> writes:
If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
More simply, I think that it's an error to have the encoding par
Marcin Kowalczyk noted:
> Unicode has the following property. Consider sequences of valid
> Unicode characters: from the range U+..U+10, excluding
> non-characters (i.e. U+nFFFE and U+n for n from 0 to 0x10 and
> U+FDD0..U+FDEF) and surrogates. Any such sequence can be encoded
> in any
Title: RE: Roundtripping in Unicode
> From: [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]] On Behalf Of Peter Kirk
> Sent: Tuesday, December 14, 2004 11:32 AM
> This is a design flaw in Unix, or in how it is explained to
> users. Well, Lars wrote "Basically, you
"Arcane Jill" <[EMAIL PROTECTED]> writes:
> If so, Marcin, what exactly is the error, and whose fault is it?
It's an error to use locales with different encodings on the same
system.
--
__("< Marcin Kowalczyk
\__/ [EMAIL PROTECTED]
^^ http://qrnik.knm.org.pl/~qrczak/
Lars said:
> According to UTC, you need to keep processing
> the UNIX filenames as BINARY data. And, also according to UTC, any UTF-8
> function is allowed to reject invalid sequences. Basically, you are not
> supposed to use strcpy to process filenames.
This is a very misleading set of statement
On 14/12/2004 11:32, Arcane Jill wrote:
I've been following this thread for a while, and I've pretty much got
the hang of the issues here. To summarize:
I haven't followed everything, but here is my 2 cents worth.
I note that there is a real problem. I have had significant problems in
Windows wi
Lars Kristan <[EMAIL PROTECTED]> writes:
> Hm, here lies the catch. According to UTC, you need to keep
> processing the UNIX filenames as BINARY data. And, also according
> to UTC, any UTF-8 function is allowed to reject invalid sequences.
> Basically, you are not supposed to use strcpy to pro
"Arcane Jill" <[EMAIL PROTECTED]> writes:
> OBSERVATION - Roundtripping is possible in the direction NOT-UTF-8 ->
> NOT-UTF-16 -> NOT-UTF-8
But it's not possible in the direction NOT-UTF-16 -> NOT-UTF-8 ->
NOT-UTF-16, unless you define valid sequences of NOT-UTF-16 in an
awkward way which would h
On 14/12/2004 17:47, John Cowan wrote:
Peter Kirk scripsit:
I think the problem here is that a Unix filename is a string of octets,
not of characters. And so it should not be converted into another
encoding form as if it is characters; it should be processed at a quite
different level of inte
Peter Kirk scripsit:
> I think the problem here is that a Unix filename is a string of octets,
> not of characters. And so it should not be converted into another
> encoding form as if it is characters; it should be processed at a quite
> different level of interpretation.
Unfortunately, that
>
> Is that right, Lars?
>
> If so, Marcin, what exactly is the error, and whose fault is it?
>
> Jill
>
> -Original Message-
>
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
>
> Behalf Of Marcin 'Qrczak' Kowalczyk
>
> Sent: 13 December 2004 14:59
>
> To: [EMAIL PROTECTED]
>
> Subject: Re: Roundtripping in Unicode
>
> Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
>
>
>
>
>
>
Title: RE: Roundtripping in Unicode
Arcane Jill wrote:
> I've been following this thread for a while, and I've pretty
Thanks for bearing with me. And I hope my response will not discourage you from continuing to do so. That is, until I am banned from the list for heresy.
&
I've been following this thread for a while, and I've pretty much got the
hang of the issues here. To summarize:
Unix filenames consist of an arbitrary sequence of octets, excluding 0x00
and 0x2F. How they are /displayed/ to any given user depends on that user's
locale setting. In this scenario
Philippe VERDY wrote:
> (In fact I also think that mapping invalid sequences to U+FFFD is also
> an error, because U+FFFD is valid, and the presence of the encoding
> error in the source is lost, and will not throw exceptions in further
> processings of the remapped text, unless the application c
nvalid sequences of 8-bit or 32-bit code units.
- Original Message -
From: "Mark Davis" <[EMAIL PROTECTED]>
To: "Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, December 13, 2004 11:04 PM
Subject: Re
Title: RE: Roundtripping in Unicode
Peter Kirk wrote:
> Now no doubt many Unix filename handling utilities ignore the
> fact that
> some octets are invalid or uninterpretable in the locale,
> because they
> handle filenames as octet strings (with 0x00 and 0x2F
Doug Ewell scripsit:
> "When faced with [an] ill-formed code unit sequence while transforming
> or interpreting text, a conformant process must treat the first code
> unit... as an illegally terminated code unit sequence -- for example, by
> signaling an error, filtering the code unit out, or repr
x27; Kowalczyk
Sent: 13 December 2004 14:59
To: [EMAIL PROTECTED]
Subject: Re: Roundtripping in Unicode
Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
ferent than private use (where they also have to agree on the
interpretation).
âMark
- Original Message -
From: "Kenneth Whistler" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Monday, December 13, 2004 13:04
Subject: RE: Roundtr
Title: RE: Roundtripping in Unicode
> Ken is absolutely right. It would be theoretically possible
> to add 128 code
> points that would allow one to roundtrip a bytestream after
> passing through
> a UTF-8 <=> UTF-32 conversion. (For that matter, it would be
> pos
Lars Kristan stated:
> I said, the choice is yours. My proposal does not prevent you from doing it
> your way. You don't need to change anything and it will still work the way
> it worked before. OK? I just want 128 codepoints so I can make my own
> choice.
You have them: U+EE80..U+EEFF, which a
Title: RE: Roundtripping in Unicode
Kenneth Whistler wrote:
> Lars Kristan stated:
>
> > I said, the choice is yours. My proposal does not prevent
> you from doing it
> > your way. You don't need to change anything and it will
> still work the way
> > it
Title: RE: RE: RE: Roundtripping in Unicode
Philippe VERDY wrote:
> I don't think I miss the point. My suggested approach to
> perform roundtrip conversions between UTF's while keeping all
> invalid sequences as invalid (for the standard UTFs), is much
> less risky t
Lars Kristan wrote:> What I was talking about in the paragraph in question is what happens if you want to take unassigned codepoints and give them a new status.
You don't need to do that. No Unicode application must assign semantics to unassigned codepoints.
If a source sequence is invalid, and you
> From : "Lars Kristan"
> Philippe VERDY wrote:
> > If a source sequence is invalid, and you want to preserve it,
> > then this sequence must remain invalid if you change its encoding.
> > So there's no need for Unicode to assign valid code points
> > for invalid source data.
> Using invalid
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> UTF-8 is painful to process in the first place. You are making it
> even harder by demanding that all functions which process UTF-8 do
> something sensible for bytes which don't form valid
Lars Kristan <[EMAIL PROTECTED]> writes:
> And once we understand that things are manageable and not as
> frigtening as it seems at first, then we can stop using this as an
> argument against introducing 128 codepoints. People who will find
> them useful should and will bother with the consequence
Title: RE: RE: Roundtripping in Unicode
Philippe VERDY wrote:
> If a source sequence is invalid, and you want to preserve it,
> then this sequence must remain invalid if you change its encoding.
> So there's no need for Unicode to assign valid code points
> for invalid s
Title: RE: Roundtripping in Unicode
Philippe Verdy wrote:
> An implementation that uses UTF-8 for valid string could use
> the invalid
> ranges for lead bytes to encapsultate invalid byte values.
> Note however that
> invalid bytes you would need to represent have 256 pos
Lars Kristan <[EMAIL PROTECTED]> writes:
> But, as I once already said, you can do it with UTF-8, you simply
> keep the invalid sequences as they are, and really handle them
> differently only when you actually process them or display them.
UTF-8 is painful to process in the first place. You are
Title: RE: Roundtripping in Unicode
Philippe Verdy wrote:
> From: "Doug Ewell" <[EMAIL PROTECTED]>
> > Lars Kristan wrote:
> >> I am sure one of the standardizers will find a Unicodally
> >> correct way of putting it.
> >
> > I can
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> You are trying to stick with processing byte sequences, carefully
> preserving the storage format instead of preserving the meaning in
> terms of Unicode characters. This leads to less robust soft
Lars Kristan <[EMAIL PROTECTED]> writes:
>> Please make up your mind: either they are valid and programs are
>> required to accept them, or they are invalid and programs are required
>> to reject them.
>
> I don't know what they should be called. The fact is there shouldn't be any.
> And that cur
RE: Roundtripping in UnicodeMy view about this "problem" of roundtripping is
that if data, supposed to contain only valid UTF-8 sequences, contains some
invalid byte sequences that still need to be roundtripped to some "code
point" for internal management that can be roundtripped later to the
o
From: "Doug Ewell" <[EMAIL PROTECTED]>
Lars Kristan wrote:
I am sure one of the standardizers will find a Unicodally
correct way of putting it.
I can't even understand that paragraph, let alone paraphrase it.
My understanding of his question and my reponse to his problem is that you
MUST not use V
RE: Roundtripping in Unicode
Lars Kristan wrote:
>>> All assigned codepoints do roundtrip even in my concept.
>>> But unassigned codepoints are not valid data.
>>
>> Please make up your mind: either they are valid and programs are
>> required to accept them,
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <[EMAIL PROTECTED]> writes:
>
> > All assigned codepoints do roundtrip even in my concept.
> > But unassigned codepoints are not valid data.
>
> Please make up your
Lars Kristan <[EMAIL PROTECTED]> writes:
> All assigned codepoints do roundtrip even in my concept.
> But unassigned codepoints are not valid data.
Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are required
to reject the
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> > Roundtrip for valid data is of course essential and needs to be
> > preserved.
>
> Your proposal does not do this.
All assigned codepoints do roundtrip even in my concept. But unassigned c
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <[EMAIL PROTECTED]> writes:
>
> > The other name for this is roundtripping. Currently, Unicode allows
> > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are
Lars Kristan <[EMAIL PROTECTED]> writes:
>> It's essential that any UTF-n can be translated to any other without
>> loss of data. Because it allows to use an implementation of the given
>> functionality which represents data in any form, not necessarily the
>> form we have at hand, as long as corr
Lars Kristan <[EMAIL PROTECTED]> writes:
> The other name for this is roundtripping. Currently, Unicode allows
> a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are
> several reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more
> valuable, even if it means that the other roundtrip i
65 matches
Mail list logo