RE: Roundtripping in UnicodeMy view about this "problem" of roundtripping is
that if data, supposed to contain only valid UTF-8 sequences, contains some
invalid byte sequences that still need to be roundtripped to some "code
point" for internal management that can be roundtripped later to the
o
"Lars Kristan" writes:
> > A system administrator (because he has access to all files).
> My my, you are assuming all files are in the same encoding. And what about
> all the references to the files in scripts? In configuration files? Soft
> links? If you want to break things, this is definitely
"Marcin 'Qrczak' Kowalczyk" writes:
> But demanding that each program which searches strings checks for
> combining classes is I'm afraid too much.
How is it any different from a case-insenstive search?
> >> Does "\n" followed by a combining code point start a new line?
> >
> > The Standard
From: "Michael Everson" <[EMAIL PROTECTED]>
Nonsense. You might as well try to explain SPQR on the same basis.
I won't. I know that SPQR was used on architectural constructions as a
symbol of the Roman Empire, and it was a wellknown acronym of a Latin
expression.
It largely predates the inventio
From: "Peter R. Mueller-Roemer" <[EMAIL PROTECTED]>
For a fixed length of combining character sequence (base + 3 combining
marks is the most I have seen graphically distinguishable) the repertore
is still finite.
I do think that you are underestimating the repertoire. Also Unicode does
NOT defin
At 01:12 +0100 2004-12-12, Philippe Verdy wrote:
I would not be surprised if this acronym was defined in some
internationally accepted set of abbreviations used by telegraphists,
so that their clients became exposed to these acronyms when reading
telegrams received from their local post office t
From: "Séamas Ó Brógáin" <[EMAIL PROTECTED]>
John wrote:
As far as I know, they were first used in formal invitations (to
weddings,
funerals, dances, etc.) in the corner of the card, as both shorter and
more fancy than the older phrase "The favor of your reply is requested".
This is correct. The p
From: "Doug Ewell" <[EMAIL PROTECTED]>
Lars Kristan wrote:
I am sure one of the standardizers will find a Unicodally
correct way of putting it.
I can't even understand that paragraph, let alone paraphrase it.
My understanding of his question and my reponse to his problem is that you
MUST not use V
Title: RE: Nicest UTF
Missed this one the other day, but cannot let it go...
Marcin 'Qrczak' Kowalczyk wrote:
> > filenames, what is one supposed to do? Convert all
> filenames to UTF-8?
>
> Yes.
>
> > Who will do that?
>
> A system administrator (because he has access to all files).
John wrote:
As far as I know, they were first used in formal invitations (to
weddings,
funerals, dances, etc.) in the corner of the card, as both shorter and
more fancy than the older phrase "The favor of your reply is
requested".
This is correct. The practice dates from the end of the nineteenth
RE: Roundtripping in Unicode
Lars Kristan wrote:
>>> All assigned codepoints do roundtrip even in my concept.
>>> But unassigned codepoints are not valid data.
>>
>> Please make up your mind: either they are valid and programs are
>> required to accept them, or they are invalid and programs are
>>
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Philippe Verdy wrote:
> This is a known caveat even for Unix, when you look at the
> tricky details of
> the support of Windows file sharing through Samba, when the
> client requests
> a file with a "short" 8.3 name, that a partition
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <[EMAIL PROTECTED]> writes:
>
> > All assigned codepoints do roundtrip even in my concept.
> > But unassigned codepoints are not valid data.
>
> Please make up your mind: either they are valid and programs ar
on 2004-12-11 09:21 John Cowan wrote:
It's been used as an English verb, adjective, and noun for 30-40 years
and perhaps much longer: see below.
Longer. I can attest from my youth in the 1950s that my parents
considered it ordinary English usage, and in fact knew of its origin.
--
Curtis Clark
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]>
Regarding A, I see three choices:
1. A string is a sequence of code points.
2. A string is a sequence of combining character sequences.
3. A string is a sequence of code points, but it's encouraged
to process it in groups of combining characte
Philippe,
>>However, within the program itself UTF-8 presents a
>>problem when looking for specific data in memory buffers.
>>It is nasty, time consuming and error prone. Mapping
>>UTF-16 to code points is a snap as long as you
>>do not have a lot of surrogates. If you do then probably
>>UTF-32
Lars Kristan <[EMAIL PROTECTED]> writes:
> All assigned codepoints do roundtrip even in my concept.
> But unassigned codepoints are not valid data.
Please make up your mind: either they are valid and programs are
required to accept them, or they are invalid and programs are required
to reject the
Michael Everson wrote:
> In Ireland sometime in the early nineties, the Allied Irish Bank
> became AIB Bank, the Allied Irish Bank Bank.
Israel Discount Bank of New York regularly refers to itself as "IDB
Bank."
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote:
> Further, as it turns out that Lars is actually asking for
> "standardizing" corrupt UTF-8, a notion that isn't going to
> fly even two feet, I think the whole idea is going to be
> a complete non-starter.
Tech
On 11/12/2004 02:29, Mark Davis wrote:
This is just a confusion among the hoi polloi.
âMark
But such things happen not just among the German and Swedish polloi, but
even in the crowning heights of the English language. The word
"cherubims" is used many times in the King James Bible and at leas
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> > Roundtrip for valid data is of course essential and needs to be
> > preserved.
>
> Your proposal does not do this.
All assigned codepoints do roundtrip even in my concept. But unassigned codepoints are not valid data.
Philippe Verdy scripsit:
> Didn't know that. Is this a very recent use?
It's been used as an English verb, adjective, and noun for 30-40 years
and perhaps much longer: see below.
> In France, I think that RSVP was introduced and widely used at end of
> telegraphic messages (that contained lots
Title: RE: Roundtripping in Unicode
Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <[EMAIL PROTECTED]> writes:
>
> > The other name for this is roundtripping. Currently, Unicode allows
> > a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are
> > several reasons why a UTF-8=>UTF-16(
Philippe Verdy wrote:ãääåäææâåääââåäâåâåäããâçæææ
ææâäãææãææãççæãççæãææãææãâæçãææ
The repertoire of all possible combining characters sequences is
already infinite in Unicode, as well as the number of "default
grapheme clusters" they can represent.
For a fixed length of combining character sequ
Lars Kristan <[EMAIL PROTECTED]> writes:
>> It's essential that any UTF-n can be translated to any other without
>> loss of data. Because it allows to use an implementation of the given
>> functionality which represents data in any form, not necessarily the
>> form we have at hand, as long as corr
Lars Kristan <[EMAIL PROTECTED]> writes:
> The other name for this is roundtripping. Currently, Unicode allows
> a roundtrip UTF-16=>UTF-8=>UTF-16. For any data. But there are
> several reasons why a UTF-8=>UTF-16(32)=>UTF-8 roundtrip is more
> valuable, even if it means that the other roundtrip i
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Kenneth Whistler wrote:
> Lars responded:
>
> > > ... Whatever the solutions
> > > for representation of corrupt data bytes or uninterpreted data
> > > bytes on conversion to Unicode may be, that is irrelevant to the
> > > concerns on
Title: Roundtripping in Unicode (was RE: Invalid UTF-8 sequences)
Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <[EMAIL PROTECTED]> writes:
>
> > Quite close. Except for the fact that:
> > * U+EE93 is represented in UTF-32 as 0xEE93
> > * U+EE93 is represented in UTF-16 as 0xEE93
> > *
Am 11.12.2004 um 04:32 schrieb Clark Cox:
There are always the classics: "ATM Machine" and "PIN Number"
Here in germany, they say "ASCII-Code". :-)
Johannes
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
John Cowan wrote:
> However, although they are *technically* octet sequences, they
> are *functionally* character strings. That's the issue.
Nicely put! But UTC does not seem to care.
>
> > The point I'm making is that *whatever* you
At 17:38 -0800 2004-12-10, Asmus Freytag wrote:
Other examples of apparent redundancy, are
Cakes -> Keks (German), plural Kekse
Baby -> bebis (Swedish), plural bebissar
and there are many more such examples.
In Ireland sometime in the early nineties, the Allied Irish Bank
became AIB Bank, the Alli
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)
Arcane Jill responded:
> >> Windows filesystems do know what encoding they use.
> >Err, not really. MS-DOS *need to know* the encoding to use,
> >a bit like a
> >*nix application that displays filenames need to know the
> >encoding to
"D. Starner" <[EMAIL PROTECTED]> writes:
>> > This implies that every programmer needs an indepth knowledge of
>> > Unicode to handle simple strings.
>>
>> There is no way to avoid that.
>
> Then there's no way that we're ever going to get reliable Unicode
> support.
This is probably true.
I
Title: RE: When to validate?
Antoine Leca wrote:
> As a result, your strings are likely to be some stuctures.
> Then, it is pretty easy to add some s_valid flag, and you are done.
Is that a proven technique? I'd say not. The flag would only be valid for as long as the string is not changed. Yo
"Philippe Verdy" <[EMAIL PROTECTED]> writes:
[...]
> This was later amended in an errata for XML 1.0 which now says that
> the list of code points whose use is *discouraged* (but explicitly
> *not* forbidden) for the "Char" production is now:
[...]
Ugh, it's a mess...
IMHO Unicode is partially t
Title: RE: When to validate?
Andy Heninger wrote:
>
> Some important things in designing a function API are
>
> o Fully define what the behavior is. With a function like
> tolower(), you could leave malformed sequences unaltered;
> you could replace them with some substitution c
36 matches
Mail list logo