RE: Nicest UTF

2004-12-14 Thread Lars Kristan
Title: RE: Nicest UTF D. Starner wrote: > > Some won't convert any and will just start using UTF-8 > > for new ones. And this should be allowed. > > Why should it be allowed? You can't mix items with > different unlabeled encodings willy-nilly. All you'r

Re: Nicest UTF

2004-12-13 Thread Philippe Verdy
From: "D. Starner" <[EMAIL PROTECTED]> Some won't convert any and will just start using UTF-8 for new ones. And this should be allowed. Why should it be allowed? You can't mix items with different unlabeled encodings willy-nilly. All you're going to get, all you can expect to get is a mess. When yo

RE: Nicest UTF

2004-12-13 Thread D. Starner
> Some won't convert any and will just start using UTF-8 > for new ones. And this should be allowed. Why should it be allowed? You can't mix items with different unlabeled encodings willy-nilly. All you're going to get, all you can expect to get is a mess. -- ___

Re: Nicest UTF

2004-12-13 Thread John Cowan
Lars Kristan scripsit: > > I'm using ISO-8859-2. > In fact you're lucky. Many ISO-8859-1 filenames display correctly in > ISO-8859-2. Not all users are so lucky. It was a design point of ISO-8859-{1,2,3,4}, but not any other variants, that every character appears either at the same codepoint or n

RE: Nicest UTF

2004-12-13 Thread Lars Kristan
Title: RE: Nicest UTF D. Starner wrote: > "Lars Kristan" writes: >  > > > A system administrator (because he has access to all files). > > My my, you are assuming all files are in the same encoding. > And what about > > all the references to the fil

RE: Nicest UTF

2004-12-13 Thread Lars Kristan
Title: RE: Nicest UTF Marcin 'Qrczak' Kowalczyk wrote: > > My my, you are assuming all files are in the same encoding. > > Yes. Otherwise nothing shows filenames correctly to the user. UNIX is a multi user system. One user can use one locale and might never see files

Re: infinite combinations, was Re: Nicest UTF

2004-12-12 Thread Peter Kirk
On 11/12/2004 16:53, Peter R. Mueller-Roemer wrote: ... For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. In Hebrew it is actually possible to have up to 9 combining marks with a singl

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > It's hard to create a general model that will work for all scripts > encoded in Unicode. There are too many differences. So Unicode just > appears to standardize a higher level of processing with combining > sequences and normalization forms that are

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> But demanding that each program which searches strings checks for >> combining classes is I'm afraid too much. > > How is it any different from a case-insenstive search? We started from string equality, which somehow changed into searching. Default st

Re: Nicest UTF

2004-12-12 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: > My my, you are assuming all files are in the same encoding. Yes. Otherwise nothing shows filenames correctly to the user. > And what about all the references to the files in scripts? > In configuration files? Such files rarely use non-ASCII characters.

RE: Nicest UTF

2004-12-11 Thread D. Starner
"Lars Kristan" writes: > > A system administrator (because he has access to all files). > My my, you are assuming all files are in the same encoding. And what about > all the references to the files in scripts? In configuration files? Soft > links? If you want to break things, this is definitely

Re: Nicest UTF

2004-12-11 Thread D. Starner
"Marcin 'Qrczak' Kowalczyk" writes: > But demanding that each program which searches strings checks for > combining classes is I'm afraid too much. How is it any different from a case-insenstive search? > >> Does "\n" followed by a combining code point start a new line? > > > > The Standard

Re: infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: "Peter R. Mueller-Roemer" <[EMAIL PROTECTED]> For a fixed length of combining character sequence (base + 3 combining marks is the most I have seen graphically distinguishable) the repertore is still finite. I do think that you are underestimating the repertoire. Also Unicode does NOT defin

RE: Nicest UTF

2004-12-11 Thread Lars Kristan
Title: RE: Nicest UTF Missed this one the other day, but cannot let it go... Marcin 'Qrczak' Kowalczyk wrote: > > filenames, what is one supposed to do? Convert all > filenames to UTF-8? > > Yes. > > > Who will do that? > > A system administr

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Philippe Verdy wrote: > This is a known caveat even for Unix, when you look at the > tricky details of > the support of Windows file sharing through Samba, when the > client requests > a file with a "sho

Re: Nicest UTF

2004-12-11 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Regarding A, I see three choices: 1. A string is a sequence of code points. 2. A string is a sequence of combining character sequences. 3. A string is a sequence of code points, but it's encouraged to process it in groups of combining characte

RE: Software support costs (was: Nicest UTF

2004-12-11 Thread Carl W. Brown
Philippe, >>However, within the program itself UTF-8 presents a >>problem when looking for specific data in memory buffers. >>It is nasty, time consuming and error prone. Mapping >>UTF-16 to code points is a snap as long as you >>do not have a lot of surrogates. If you do then probably >>UTF-32

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Further, as it turns out that Lars is actually asking for > "standardizing" corrupt UTF-8, a notion that isn't going to > fly even two feet, I think the whole idea is going to be

infinite combinations, was Re: Nicest UTF

2004-12-11 Thread Peter R. Mueller-Roemer
Philippe Verdy wrote:ãääåäææâåääââåäâåâåäããâçæææ ææâäãææãææãççæãççæãææãææãâæçãææ The repertoire of all possible combining characters sequences is already infinite in Unicode, as well as the number of "default grapheme clusters" they can represent. For a fixed length of combining character sequ

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > Lars responded: > > > > ... Whatever the solutions > > > for representation of corrupt data bytes or uninterpreted data > > > bytes on conversion to Unicode may be, that is ir

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) John Cowan wrote: > However, although they are *technically* octet sequences, they > are *functionally* character strings.  That's the issue. Nicely put! But UTC does not seem to care. > > > The point I'm

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-11 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Arcane Jill responded: > >> Windows filesystems do know what encoding they use. > >Err, not really. MS-DOS *need to know* the encoding to use, > >a bit like a > >*nix application that displays filenames need

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> > This implies that every programmer needs an indepth knowledge of >> > Unicode to handle simple strings. >> >> There is no way to avoid that. > > Then there's no way that we're ever going to get reliable Unicode > support. This is probably true. I

Re: Nicest UTF

2004-12-11 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: [...] > This was later amended in an errata for XML 1.0 which now says that > the list of code points whose use is *discouraged* (but explicitly > *not* forbidden) for the "Char" production is now: [...] Ugh, it's a mess... IMHO Unicode is partially t

Re: Nicest UTF

2004-12-10 Thread John Cowan
Philippe Verdy scripsit: > And I disagree with you about the fact the U+ can't be used in XML > documents. It can be used in URI through URI escaping mechanism, as > explicitly indicated in the XML specification... You have a hold of the right stick but at the wrong end. U+ can be enco

Re: Nicest UTF

2004-12-10 Thread John Cowan
Philippe Verdy scripsit: > >Okay, I'm confused. Does ≮ open a tag? Does it matter if it's > >composed or decomposed? > > It does not open a XML tag. > It does matter if it's composed (won't open a tag) or decomposed (will > open a tag, but with a combining character, invalid as an identifier >

Re: Nicest UTF

2004-12-10 Thread John Cowan
Philippe Verdy scripsit: > If you look at the XML 1.0 Second Edition The Second Edition has been superseded by the Third. > Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | > [#x1-#x10] That is normative. > But the comment following it specifies: That comment is not n

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "D. Starner" <[EMAIL PROTECTED]> Okay, I'm confused. Does ≮ open a tag? Does it matter if it's composed or decomposed? It does not open a XML tag. It does matter if it's composed (won't open a tag) or decomposed (will open a tag, but with a combining character, invalid as an identifier star

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "John Cowan" <[EMAIL PROTECTED]> Marcin 'Qrczak' Kowalczyk scripsit: http://www.w3.org/TR/2000/REC-xml-20001006#charsets implies that the appropriate level for parsing XML is code points. You are reading the XML Recommendation incorrectly. It is not defined in terms of codepoints (8-bit, 16-

Re: Nicest UTF

2004-12-10 Thread D. Starner
John Cowan writes: > You are reading the XML Recommendation incorrectly.  It is not defined > in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of > characters.  XML processors are required to process UTF-8 and UTF-16, > and may process other character encodings or not.  But the inter

Re: Nicest UTF

2004-12-10 Thread John Cowan
Marcin 'Qrczak' Kowalczyk scripsit: > http://www.w3.org/TR/2000/REC-xml-20001006#charsets > implies that the appropriate level for parsing XML is code points. You are reading the XML Recommendation incorrectly. It is not defined in terms of codepoints (8-bit, 16-bit, or 32-bit) but in terms of c

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "Philippe Verdy" <[EMAIL PROTECTED]> From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Philippe Verdy" <[EMAIL PROTECTED]> writes: The XML/HTML core syntax is defined with fixed behavior of some individual characters like '&', '<', quotation marks, and with special behavior for spaces. T

Re: Nicest UTF

2004-12-10 Thread D. Starner
"Marcin 'Qrczak' Kowalczyk" writes: > "D. Starner" writes: > > > This implies that every programmer needs an indepth knowledge of > > Unicode to handle simple strings. > > There is no way to avoid that. Then there's no way that we're ever going to get reliable Unicode support. > If the ru

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining

Re: Nicest UTF

2004-12-10 Thread Philippe Verdy
- Original Message - From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, December 10, 2004 8:35 PM Subject: Re: Nicest UTF "Philippe Verdy" <[EMAIL PROTECTED]> writes: The XML/HTML core syntax i

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> > The XML/HTML core syntax is defined with fixed behavior of some >> > individual characters like '&', '<', quotation marks, and with special >> > behavior for spaces. >> >> The point is: what "characters" mean in this sentence. Code points? >> Combining

Re: Software support costs (was: Nicest UTF

2004-12-10 Thread Philippe Verdy
From: "Carl W. Brown" <[EMAIL PROTECTED]> Philippe, Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making

Re: Nicest UTF

2004-12-10 Thread John Cowan
Marcin 'Qrczak' Kowalczyk scripsit: > > The XML/HTML core syntax is defined with fixed behavior of some > > individual characters like '&', '<', quotation marks, and with special > > behavior for spaces. > > The point is: what "characters" mean in this sentence. Code points? > Combining character

Re: Software support costs (was: Nicest UTF)

2004-12-10 Thread Theodore H. Smith
Philippe, Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making it easier to find the specific data you a

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The XML/HTML core syntax is defined with fixed behavior of some > individual characters like '&', '<', quotation marks, and with special > behavior for spaces. The point is: what "characters" mean in this sentence. Code points? Combining character se

Re: Nicest UTF

2004-12-10 Thread Marcin 'Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: >> String equality in a programming language should not treat composed >> and decomposed forms as equal. Not this level of abstraction. > > This implies that every programmer needs an indepth knowledge of > Unicode to handle simple strings. There is no way

Software support costs (was: Nicest UTF

2004-12-10 Thread Carl W. Brown
Philippe, > Also a broken opening tag for HTML/XML documents In addition to not having endian problems UTF-8 is also useful when tracing intersystem communications data because XML and other tags are usually in the ASCII subset of UTF-8 and stand out making it easier to find the specific data you

Re: Nicest UTF

2004-12-09 Thread Doug Ewell
Philippe Verdy wrote: >> Please start adding spaces to your entity references or >> something, because those of us reading this through a web interface >> are getting very confused. > > No confusion possible if using any classic mail reader. > > Blame your ISP (and other ISPs as well like AOL tha

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Philippe Verdy
From: "Antoine Leca" <[EMAIL PROTECTED]> Err, not really. MS-DOS *need to know* the encoding to use, a bit like a *nix application that displays filenames need to know the encoding to use the correct set of glyphs (but constrainst are much more heavy.) Also Windows NT Unicode applications know it,

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy
From: "D. Starner" <[EMAIL PROTECTED]> "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: If it's a broken character reference, then what about Á (769 is the code for combining acute if I'm not mistaken)? Please start adding spaces to your entity references or something, because those of us r

Re: Nicest UTF

2004-12-09 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Ok, so it's the conversion from raw text to escaped character references which should treat combining characters specially. What about < with combining acute, which doesn't have a precomposed form? A broken opening tag or a valid text character?

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Arcane Jill
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Antoine Leca Sent: 09 December 2004 11:29 To: Unicode Mailing List Subject: Re: Invalid UTF-8 sequences (was: Re: Nicest UTF) Windows filesystems do know what encoding they use. Err, not really. MS-DOS *need to

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-09 Thread Antoine Leca
On Monday, December 6th, 2004 20:52Z John Cowan va escriure: > Doug Ewell scripsit: > >>> Now suppose you have a UNIX filesystem, containing filenames in a >>> legacy encoding (possibly even more than one). If one wants to >>> switch to UTF-8 filenames, what is one supposed to do? Convert all >>>

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler
Lars responded: > > ... Whatever the solutions > > for representation of corrupt data bytes or uninterpreted data > > bytes on conversion to Unicode may be, that is irrelevant to the > > concerns on whether an application is using UTF-8 or UTF-16 > > or UTF-32. > The important fact is that if you

Re: Nicest UTF

2004-12-08 Thread Kenneth Whistler
Marcin asked: > The general trouble is that numeric character references can only > encode individual code points By design. > rather than graphemes (is this a correct > term for a non-combining code point with a sequence of combining code > points?). No. The correct term is "combining characte

Re: Nicest UTF

2004-12-08 Thread D. Starner
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: > If it's a broken character reference, then what about Á (769 is > the code for combining acute if I'm not mistaken)? Please start adding spaces to your entity references or something, because those of us reading this through a web interfa

Re: Nicest UTF

2004-12-08 Thread D. Starner
"Marcin 'Qrczak' Kowalczyk" writes: > String equality in a programming language should not treat composed > and decomposed forms as equal. Not this level of abstraction. This implies that every programmer needs an indepth knowledge of Unicode to handle simple strings. The concept makes me want to

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
John Cowan <[EMAIL PROTECTED]> writes: >> String equality in a programming language should not treat composed >> and decomposed forms as equal. Not this level of abstraction. > > Well, that assumes that there's a special "string equality" predicate, > as distinct from just having various predicate

Re: Nicest UTF

2004-12-08 Thread John Cowan
Marcin 'Qrczak' Kowalczyk scripsit: > String equality in a programming language should not treat composed > and decomposed forms as equal. Not this level of abstraction. Well, that assumes that there's a special "string equality" predicate, as distinct from just having various predicates that DWI

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: > The semantics there are surprising, but that's true no matter what you > do. An NFC string + an NFC string may not be NFC; the resulting text > doesn't have N+M graphemes. Which implies that automatically NFC-ing strings as they are processed would be a

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread John Cowan
Kenneth Whistler scripsit: > A Sybase ASE database has the same behavior running on Windows as > running on Sun Solaris or Linux, for that matter. Fair enough. > UNIX filenames are just one instance of this. However, although they are *technically* octet sequences, they are *functionally* char

Re: Nicest UTF

2004-12-08 Thread D. Starner
"Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> writes: > "D. Starner" <[EMAIL PROTECTED]> writes: > > > You could hide combining characters, which would be extremely useful if we > > were just using Latin > > and Cyrillic scripts. > > It would need a separate API for examining the contents of a

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Kenneth Whistler
John Cowan responded: > > Storage of UNIX filenames on Windows databases, for example, ^^ O.k., I just quoted this back from the original email, but it really is a complete misconception of the issue for databases. "Windows databases" is a misn

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Kenneth Whistler wrote: > I'm going to step in here, because this argument seems to > be generating more heat than light. I agree, and I thank you for that. > First, I'm going to summarize what I think Lars Kristan

Re: Nicest UTF

2004-12-08 Thread Marcin 'Qrczak' Kowalczyk
"D. Starner" <[EMAIL PROTECTED]> writes: > You could hide combining characters, which would be extremely useful if > we were just using Latin and Cyrillic scripts. It would need a separate API for examining the contents of a combining character. You can't avoid the sequence of code points comple

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) > Needless to say, these systems were badly designed at their > origin, and > newer filesystems (and OS APIs) offer much better > alternative, by either > storing explicitly on volumes which encoding it uses, or by

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-08 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: > How do file names work when the user changes from one SBCS to another > (let's ignore UTF-8 for now) where the interpretation is > different?  For > example, byte C3 is U+00C3, A with tilde (Ã) in I

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Lars Kristan wrote: > I never said it doesn't violate any existing rules. Stating that it > does, doesn't help a bit. Rules can be changed. Assuming we understand > the consequences. And that is what we should be discussing

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
Kenneth Whistler wrote: > I do not think this is a proposal to amend UTF-8 to allow > invalid sequences. So we should get that off the table. I hope you are right. > Apparently Lars is currently using PUA U+E080..U+E0FF > (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping > of

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
Philippe Verdy wrote: > An alternative can then be a mixed encoding selection: > - choose a legacy encoding that will most often be able to represent > valid filenames without loss of information (for example ISO-8859-1, > or Cp1252). > - encode the filename with it. > - try to decode it with a *

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread John Cowan
Kenneth Whistler scripsit: > Storage of UNIX filenames on Windows databases, for example, > can be done with BINARY fields, which correctly capture the > identity of them as what they are: an unconvertible array of > byte values, not a convertible string in some particular > code page. This solut

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Kenneth Whistler
Lars, I'm going to step in here, because this argument seems to be generating more heat than light. > I never said it doesn't violate any existing rules. Stating that it does, > doesn't help a bit. Rules can be changed. > I ask you to step back and try to see the big picture. First, I'm going

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler
Philippe continued: > As if Unicode had to be bound on > architectural constraints such as the requirement of representing code units > (which are architectural for a system) only as 16-bit or 32-bit units, Yes, it does. By definition. In the standard. > ignoring the fact that technologies do

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Philippe Verdy
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created that

If only MS Word was coded this well (was Re: Nicest UTF)

2004-12-07 Thread Theodore H. Smith
From: "D. Starner" <[EMAIL PROTECTED]> (Sorry for sending this twice, Marcin.) "Marcin 'Qrczak' Kowalczyk" writes: UTF-8 is poorly suitable for internal processing of strings in a modern programming language (i.e. one which doesn't already have a pile of legacy functions working of bytes, but whic

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Philippe Verdy
From: "Kenneth Whistler" <[EMAIL PROTECTED]> Yes, and pigs could fly, if they had big enough wings. Once again, this is a creative comment. As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Rick McGowan
> Yes, and pigs could fly, if they had big enough wings. An 8-foot wingspan should do it. For picture of said flying pig see: http://www.cincinnati.com/bigpiggig/profile_091700.html http://www.cincinnati.com/bigpiggig/images/pig091700.jpg Rick

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler
Philippe stated, and I need to correct: > UTF-24 already exists as an encoding form (it is identical to UTF-32), if > you just consider that encoding forms just need to be able to represent a > valid code range within a single code unit. This is false. Unicode encoding forms exist by virtue of

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell replied: > Actually the Unicode Technical Committee.  But you are > correct: it is up > to the UTC to decide whether they want to redefine UTF-8 to permit > invalid sequences, which are to be interprete

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: > John Cowan wrote: > > > Windows filesystems do know what encoding they use.  But a > filename on > > a Unix(oid) file system is a mere sequence of octets, of > which only 00 &g

Re: Nicest UTF

2004-12-07 Thread Philippe Verdy
From: "D. Starner" <[EMAIL PROTECTED]> If you're talking about a language that hides the structure of strings and has no problem with variable length data, then it wouldn't matter what the internal processing of the string looks like. You'd need to use iterators and discourage the use of arbitrary

Re: Nicest UTF

2004-12-06 Thread D. Starner
(Sorry for sending this twice, Marcin.) "Marcin 'Qrczak' Kowalczyk" writes: > UTF-8 is poorly suitable for internal processing of strings in a > modern programming language (i.e. one which doesn't already have a > pile of legacy functions working of bytes, but which can be designed > to make U

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-06 Thread Philippe Verdy
- Original Message - From: "Arcane Jill" <[EMAIL PROTECTED]> Probably a dumb question, but how come nobody's invented "UTF-24" yet? I just made that up, it's not an official standard, but one could easily define UTF-24 as UTF-32 with the most-significant byte (which is always zero) remo

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell
John Cowan wrote: > Windows filesystems do know what encoding they use. But a filename on > a Unix(oid) file system is a mere sequence of octets, of which only 00 > and 2F are interpreted. (Filenames containing 20, and especially 0A, > are annoying to handle with standard tools, but not illegal

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread John Cowan
Doug Ewell scripsit: > > Now suppose you have a UNIX filesystem, containing filenames in a > > legacy encoding (possibly even more than one). If one wants to switch > > to UTF-8 filenames, what is one supposed to do? Convert all filenames > > to UTF-8? > > Well, yes. Doesn't the file system dict

Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-06 Thread Doug Ewell
RE: Nicest UTFLars Kristan wrote: >> I could not disagree more with the basic premise of Lars' post. It >> is a fundamental and critical mistake to try to "extend" Unicode with >> non-standard code unit sequences to handle data that cannot be, or >> has not been, converted to Unicode from a legac

Re: Nicest UTF

2004-12-06 Thread Marcin 'Qrczak' Kowalczyk
Lars Kristan <[EMAIL PROTECTED]> writes: >> This is simply what you have to do. You cannot convert the data >> into Unicode in a way that says "I don't know how to convert this >> data into Unicode." You must either convert it properly, or leave >> the data in its original encoding (properly marke

Re: Nicest UTF

2004-12-06 Thread Andy Heninger
Asmus Freytag wrote: A simplistic model of the 'cost' for UTF-16 over UTF-32 would consider 1) 1 extra test per character (to see whether it's a surrogate) In my experience with tuning a fair amount of utf-16 software, this test takes pretty close to zero time. All modern processors have branch a

Re: Nicest UTF

2004-12-06 Thread Antoine Leca
for a 1-3% penality in execution time. Of course, such a tiny penalty is easily hidden by other factors, such as the others Dr. Freitag mentionned. > Given this little model and some additional assumptions about your > own project(s), you should be able to determine the 'nicest&#

Re: Nicest UTF

2004-12-06 Thread Doug Ewell
Arcane Jill wrote: > Probably a dumb question, but how come nobody's invented "UTF-24" yet? > I just made that up, it's not an official standard, but one could > easily define UTF-24 as UTF-32 with the most-significant byte (which > is always zero) removed, hence all characters are stored in exac

RE: Nicest UTF

2004-12-06 Thread Lars Kristan
Title: RE: Nicest UTF Doug Ewell wrote: > RE: Nicest UTFLars Kristan wrote: > > >> I think UTF8 would be the nicest UTF. > > > > I agree. But not for reasons you mentioned. There is one other > > important advantage: UTF-8 is stored in a way that permits

Re: Nicest UTF

2004-12-06 Thread Arcane Jill
or something? Arcane Jill -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of Marcin 'Qrczak' Kowalczyk Sent: 02 December 2004 16:59 To: [EMAIL PROTECTED] Subject: Re: Nicest UTF "Arcane Jill" <[EMAIL PROTECTED]> writes: Oh for a chip

Re: Nicest UTF

2004-12-05 Thread Doug Ewell
Philippe Verdy wrote: > Only the encoder may be a bit complex to write (if one wants to > generate the optimal smallest result size), but even a moderate > programmer could find a simple and working scheme with a still > excellent compression rate (around 1 to 1.2 bytes per character on > average

Re: Nicest UTF

2004-12-05 Thread Doug Ewell
Philippe Verdy wrote: >> Here is a string, expressed as a sequence of bytes in SCSU: >> >> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E >> M o s s o v SP i s SP . > > Without looking at it, it's easy to see that this tream is separated

SCSU as internal encoding (was: Re: Nicest UTF)

2004-12-05 Thread Doug Ewell
Philippe Verdy wrote: >> The point is that indexing should better be O(1). > > SCSU is also O(1) in terms of indexing complexity... simply because it > keeps the exact equivalence with codepoints, and requires a *fixed* > (and small) number of steps to decode it to code points, but also > because

Fw: Nicest UTF

2004-12-05 Thread Philippe Verdy
From: "Doug Ewell" <[EMAIL PROTECTED]> Here is a string, expressed as a sequence of bytes in SCSU: 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E See how long it takes you to decode this to Unicode code points. (Do not refer to UTN #14; that would be cheating. :-) Without lookin

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> Now consider scanning forwards. We want to strip a beginning of a string. For example the string is an irc message prefixed with a command and we want to take the message only for further processing. We have found the end of the prefix and we wa

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > The question is why you would need to extract the nth codepoint so > blindly. For example I'm scanning a string backwards (to remove '\n' at the end, to find and display the last N lines of a buffer, to find the last '/' or last '.' in a file name).

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy
From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> "Philippe Verdy" <[EMAIL PROTECTED]> writes: The point is that indexing should better be O(1). SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. Th

Re: Nicest UTF

2004-12-05 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: >> The point is that indexing should better be O(1). > > SCSU is also O(1) in terms of indexing complexity... It is not. You can't extract the nth code point without scanning the previous n-1 code points. > But individual characters do not always have

Re: Nicest UTF

2004-12-05 Thread Philippe Verdy
- Original Message - From: "Marcin 'Qrczak' Kowalczyk" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, December 05, 2004 1:37 AM Subject: Re: Nicest UTF "Philippe Verdy" <[EMAIL PROTECTED]> writes: There's nothing that

Re: Nicest UTF

2004-12-05 Thread Doug Ewell
Philippe Verdy wrote: >> I appreciate Philippe's support of SCSU, but I don't think *even I* >> would recommend it as an internal storage format. The effort to >> encode and decode it, while by no means Herculean as often perceived, >> is not trivial once you step outside Latin-1. > > I said: "f

Re: Nicest UTF

2004-12-05 Thread Doug Ewell
RE: Nicest UTFLars Kristan wrote: >> I think UTF8 would be the nicest UTF. > > I agree. But not for reasons you mentioned. There is one other > important advantage: UTF-8 is stored in a way that permits storing > invalid sequences. I will need to elaborate that, of course. I

Re: Nicest UTF

2004-12-05 Thread Doug Ewell
Asmus Freytag wrote: > Given this little model and some additional assumptions about your > own project(s), you should be able to determine the 'nicest' UTF for > your own performance-critical case. This is absolutely correct. Each situation may have different needs and con

Re: Nicest UTF

2004-12-04 Thread Marcin 'Qrczak' Kowalczyk
"Philippe Verdy" <[EMAIL PROTECTED]> writes: > There's nothing that requires the string storage to use the same > "exposed" array, The point is that indexing should better be O(1). Not having a constant side per code point requires one of three things: 1. Using opaque iterators instead of intege

  1   2   >