Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
RE: Roundtripping in UnicodeMy view about this "problem" of roundtripping is that if data, supposed to contain only valid UTF-8 sequences, contains some invalid byte sequences that still need to be roundtripped to some "code point" for internal management that can be roundtripped later to the o

RE: Nicest UTF

2004-12-11 Thread D. Starner
"Lars Kristan" writes: > > A system administrator (because he has access to all files). > My my, you are assuming all files are in the same encoding. And what about > all the references to the files in scripts? In configuration files? Soft > links? If you want to break things, this is definitely

Re: Nicest UTF

2004-12-11 Thread D. Starner
"Marcin 'Qrczak' Kowalczyk" writes: > But demanding that each program which searches strings checks for > combining classes is I'm afraid too much. How is it any different from a case-insenstive search? > >> Does "\n" followed by a combining code point start a new line? > > > > The Standard

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy
From: "Michael Everson" <[EMAIL PROTECTED]> Nonsense. You might as well try to explain SPQR on the same basis. I won't. I know that SPQR was used on architectural constructions as a symbol of the Roman Empire, and it was a wellknown acronym of a Latin expression. It largely predates the inventio

Re: infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: "Peter R. Mueller-Roemer" <[EMAIL PROTECTED]> For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. I do think that you are underestimating the repertoire. Also Unicode does NOT defin

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Michael Everson
At 01:12 +0100 2004-12-12, Philippe Verdy wrote: I would not be surprised if this acronym was defined in some internationally accepted set of abbreviations used by telegraphists, so that their clients became exposed to these acronyms when reading telegrams received from their local post office t

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Philippe Verdy
From: "Séamas Ó Brógáin" <[EMAIL PROTECTED]> John wrote: As far as I know, they were first used in formal invitations (to weddings, funerals, dances, etc.) in the corner of the card, as both shorter and more fancy than the older phrase "The favor of your reply is requested". This is correct. The p

Re: Roundtripping in Unicode

2004-12-11 Thread Philippe Verdy
From: "Doug Ewell" <[EMAIL PROTECTED]> Lars Kristan wrote: I am sure one of the standardizers will find a Unicodally correct way of putting it. I can't even understand that paragraph, let alone paraphrase it. My understanding of his question and my reponse to his problem is that you MUST not use V

RE: Nicest UTF

2004-12-11 Thread Lars Kristan
Title: RE: Nicest UTF Missed this one the other day, but cannot let it go... Marcin 'Qrczak' Kowalczyk wrote: > > filenames, what is one supposed to do? Convert all > filenames to UTF-8? > > Yes. > > > Who will do that? > > A system administrator (because he has access to all files).

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Séamas Ó Brógáin
John wrote: As far as I know, they were first used in formal invitations (to weddings, funerals, dances, etc.) in the corner of the card, as both shorter and more fancy than the older phrase "The favor of your reply is requested". This is correct. The practice dates from the end of the nineteenth

Re: Roundtripping in Unicode

2004-12-11 Thread Doug Ewell
RE: Roundtripping in Unicode Lars Kristan wrote: >>> All assigned codepoints do roundtrip even in my concept. >>> But unassigned codepoints are not valid data. >> >> Please make up your mind: either they are valid and programs are >> required to accept them, or they are invalid and programs are >>

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Philippe Verdy wrote: > This is a known caveat even for Unix, when you look at the > tricky details of > the support of Windows file sharing through Samba, when the > client requests > a file with a "short" 8.3 name, that a partition

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > Lars Kristan <[EMAIL PROTECTED]> writes: > > > All assigned codepoints do roundtrip even in my concept. > > But unassigned codepoints are not valid data. > > Please make up your mind: either they are valid and programs ar

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread Curtis Clark
on 2004-12-11 09:21 John Cowan wrote: It's been used as an English verb, adjective, and noun for 30-40 years and perhaps much longer: see below. Longer. I can attest from my youth in the 1950s that my parents considered it ordinary English usage, and in fact knew of its origin. -- Curtis Clark

Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining characte

RE: Software support costs (was: Nicest UTF

2004-12-11 Thread Carl W. Brown
Philippe, >>However, within the program itself UTF-8 presents a >>problem when looking for specific data in memory buffers. >>It is nasty, time consuming and error prone. Mapping >>UTF-16 to code points is a snap as long as you >>do not have a lot of surrogates. If you do then probably >>UTF-32

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > All assigned codepoints do roundtrip even in my concept. > But unassigned codepoints are not valid data. Please make up your mind: either they are valid and programs are required to accept them, or they are invalid and programs are required to reject the

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Doug Ewell
Michael Everson wrote: > In Ireland sometime in the early nineties, the Allied Irish Bank > became AIB Bank, the Allied Irish Bank Bank. Israel Discount Bank of New York regularly refers to itself as "IDB Bank." -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Further, as it turns out that Lars is actually asking for > "standardizing" corrupt UTF-8, a notion that isn't going to > fly even two feet, I think the whole idea is going to be > a complete non-starter. Tech

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Peter Kirk
On 11/12/2004 02:29, Mark Davis wrote: This is just a confusion among the hoi polloi. âMark But such things happen not just among the German and Swedish polloi, but even in the crowning heights of the English language. The word "cherubims" is used many times in the King James Bible and at leas

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > > Roundtrip for valid data is of course essential and needs to be > > preserved. > > Your proposal does not do this. All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data.

Re: Please RSVP... (was: US-ASCII)

2004-12-11 Thread John Cowan
Philippe Verdy scripsit: > Didn't know that. Is this a very recent use? It's been used as an English verb, adjective, and noun for 30-40 years and perhaps much longer: see below. > In France, I think that RSVP was introduced and widely used at end of > telegraphic messages (that contained lots

RE: Roundtripping in Unicode

2004-12-11 Thread Lars Kristan
Title: RE: Roundtripping in Unicode Marcin 'Qrczak' Kowalczyk wrote: > Lars Kristan <[EMAIL PROTECTED]> writes: > > > The other name for this is roundtripping. Currently, Unicode allows > > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are > > several reasons why a UTF-8=>UTF-16(

infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Peter R. Mueller-Roemer
Philippe Verdy wrote:ãääåäææâåääââåäâåâåäããâçæææ ææâäãææãææãççæãççæãææãææãâæçãææ The repertoire of all possible combining characters sequences is already infinite in Unicode, as well as the number of "default grapheme clusters" they can represent. For a fixed length of combining character sequ

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: >> It's essential that any UTF-n can be translated to any other without >> loss of data. Because it allows to use an implementation of the given >> functionality which represents data in any form, not necessarily the >> form we have at hand, as long as corr

Re: Roundtripping in Unicode

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > The other name for this is roundtripping. Currently, Unicode allows > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are > several reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more > valuable, even if it means that the other roundtrip i

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Lars responded: > > > > ... Whatever the solutions > > > for representation of corrupt data bytes or uninterpreted data > > > bytes on conversion to Unicode may be, that is irrelevant to the > > > concerns on

Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)

2004-12-11 Thread Lars Kristan
Title: Roundtripping in Unicode (was RE: Invalid UTF-8 sequences) Marcin 'Qrczak' Kowalczyk wrote: > Lars Kristan <[EMAIL PROTECTED]> writes: > > > Quite close. Except for the fact that: > > * U+EE93 is represented in UTF-32 as 0xEE93 > > * U+EE93 is represented in UTF-16 as 0xEE93 > > *

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Johannes Bergerhausen
Am 11.12.2004 um 04:32 schrieb Clark Cox: There are always the classics: "ATM Machine" and "PIN Number" Here in germany, they say "ASCII-Code". :-) Johannes

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) John Cowan wrote: > However, although they are *technically* octet sequences, they > are *functionally* character strings.  That's the issue. Nicely put! But UTC does not seem to care. > > > The point I'm making is that *whatever* you

Re: US-ASCII (was: Re: Invalid UTF-8 sequences)

2004-12-11 Thread Michael Everson
At 17:38 -0800 2004-12-10, Asmus Freytag wrote: Other examples of apparent redundancy, are Cakes -> Keks (German), plural Kekse Baby -> bebis (Swedish), plural bebissar and there are many more such examples. In Ireland sometime in the early nineties, the Allied Irish Bank became AIB Bank, the Alli

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Arcane Jill responded: > >> Windows filesystems do know what encoding they use. > >Err, not really. MS-DOS *need to know* the encoding to use, > >a bit like a > >*nix application that displays filenames need to know the > >encoding to

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> > This implies that every programmer needs an indepth knowledge of >> > Unicode to handle simple strings. >> >> There is no way to avoid that. > > Then there's no way that we're ever going to get reliable Unicode > support. This is probably true. I

RE: When to validate?

2004-12-11 Thread Lars Kristan
Title: RE: When to validate? Antoine Leca wrote: > As a result, your strings are likely to be some stuctures. > Then, it is pretty easy to add some s_valid flag, and you are done. Is that a proven technique? I'd say not. The flag would only be valid for as long as the string is not changed. Yo

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: [...] > This was later amended in an errata for XML 1.0 which now says that > the list of code points whose use is *discouraged* (but explicitly > *not* forbidden) for the "Char" production is now: [...] Ugh, it's a mess... IMHO Unicode is partially t

RE: When to validate?

2004-12-11 Thread Lars Kristan
Title: RE: When to validate? Andy Heninger wrote: > > Some important things in designing a function API are > > o   Fully define what the behavior is.  With a function like >  tolower(), you could leave malformed sequences unaltered; >  you could replace them with some substitution c