Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Lars Kristan wrote: > I never said it doesn't violate any existing rules. Stating that it > does, doesn't help a bit. Rules can be changed. Assuming we understand > the consequences. And that is what we should be discussing. By stating > what shoul

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
Kenneth Whistler wrote: > I do not think this is a proposal to amend UTF-8 to allow > invalid sequences. So we should get that off the table. I hope you are right. > Apparently Lars is currently using PUA U+E080..U+E0FF > (or U+EE80..U+EEFF ?) for this purpose, enabling the round-tripping > of

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Doug Ewell
Philippe Verdy wrote: > An alternative can then be a mixed encoding selection: > - choose a legacy encoding that will most often be able to represent > valid filenames without loss of information (for example ISO-8859-1, > or Cp1252). > - encode the filename with it. > - try to decode it with a *

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread John Cowan
Kenneth Whistler scripsit: > Storage of UNIX filenames on Windows databases, for example, > can be done with BINARY fields, which correctly capture the > identity of them as what they are: an unconvertible array of > byte values, not a convertible string in some particular > code page. This solut

Re: Word dividers, was: proposals I wrote (and also, didn't write)

2004-12-07 Thread John Cowan
Peter Kirk scripsit: > I notice that Elaine is here proposing a HEBREW SAMARITAN PUNCTUATION > WORD DIVIDER - and this should be in the BMP as Samaritan is a script in > modern list. But there is already in the pipeline a PHOENICIAN WORD > SEPARATOR, provisionally U+1091F, and already defined U

Re: OpenType not for Open Communication?

2004-12-07 Thread John Hudson
John Cowan wrote: OpenType is a trademark of Microsoft and a proprietary font format jointly developed by Microsoft and Adobe. The question is, is it an open standard? That is, is anyone free to create OpenType fonts, OpenType font tools, OpenType font renderers? Is the documentation freely ava

Re: OpenType not for Open Communication?

2004-12-07 Thread John Cowan
John Hudson scripsit: > OpenType is a trademark of Microsoft and a proprietary font format > jointly developed by Microsoft and Adobe. The question is, is it an open standard? That is, is anyone free to create OpenType fonts, OpenType font tools, OpenType font renderers? Is the documentation f

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Kenneth Whistler
Lars, I'm going to step in here, because this argument seems to be generating more heat than light. > I never said it doesn't violate any existing rules. Stating that it does, > doesn't help a bit. Rules can be changed. > I ask you to step back and try to see the big picture. First, I'm going

Word dividers, was: proposals I wrote (and also, didn't write)

2004-12-07 Thread Peter Kirk
On 06/12/2004 22:41, E. Keown wrote: ... 1. Proposal to add Samaritan Pointing to the UCS http://www.lashonkodesh.org/samarpro.pdf WG2 number: N2748 I notice that Elaine is here proposing a HEBREW SAMARITAN PUNCTUATION WORD DIVIDER - and this should be in the BMP as Samaritan is a script in m

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler
Philippe continued: > As if Unicode had to be bound on > architectural constraints such as the requirement of representing code units > (which are architectural for a system) only as 16-bit or 32-bit units, Yes, it does. By definition. In the standard. > ignoring the fact that technologies do

Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Philippe Verdy
RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)I know wht you mean here: most Linux/Unix filesystems (as well as many legacy filesystems for Windows and MacOS...) do not track the encoding with which filenames were encoded and, depending on local user preferences when that user created that fi

If only MS Word was coded this well (was Re: Nicest UTF)

2004-12-07 Thread Theodore H. Smith
From: "D. Starner" <[EMAIL PROTECTED]> (Sorry for sending this twice, Marcin.) "Marcin 'Qrczak' Kowalczyk" writes: UTF-8 is poorly suitable for internal processing of strings in a modern programming language (i.e. one which doesn't already have a pile of legacy functions working of bytes, but whic

Re: [hebrew] Re: proposals I wrote (and also, didn't write)

2004-12-07 Thread Asmus Freytag
At 09:50 PM 12/6/2004, John Hudson wrote: I don't know. I try to avoid politics, if possible. The significance of what I'm saying is that you have made a good start in your proposal, that it has some shortcomings, and that I hope to be able to help put something more complete together. It wou

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Philippe Verdy
From: "Kenneth Whistler" <[EMAIL PROTECTED]> Yes, and pigs could fly, if they had big enough wings. Once again, this is a creative comment. As if Unicode had to be bound on architectural constraints such as the requirement of representing code units (which are architectural for a system) only as

RE: No Invisible Character - NBSP at the start of a word

2004-12-07 Thread Asmus Freytag
At 11:52 PM 12/6/2004, Jony Rosenne wrote: In chapter 8, regarding Hebrew, the standard says: Positioning. Marks may combine with vowels and other points, and there are complex typographic rules for positioning these combinations. I understand that this sentence should be regarded as being normativ

RE: proposals I wrote (and also, didn't write)

2004-12-07 Thread Peter Constable
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf > Of E. Keown > In the so-called 'deprecated' block, the 2nd Hebrew > block in the BMP, are composed Hebrew points which I > plan to go on using. And I expect everyone else to go > on using them also, all Hebraists. We think they are

Re: proposals I wrote (and also, didn't write)

2004-12-07 Thread John Hudson
E. Keown wrote: In the so-called 'deprecated' block, the 2nd Hebrew block in the BMP, are composed Hebrew points which I plan to go on using. And I expect everyone else to go on using them also, all Hebraists. We think they are needed for 'text representation' of shin and sin. It really is a be

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Rick McGowan
> Yes, and pigs could fly, if they had big enough wings. An 8-foot wingspan should do it. For picture of said flying pig see: http://www.cincinnati.com/bigpiggig/profile_091700.html http://www.cincinnati.com/bigpiggig/images/pig091700.jpg Rick

RE: OpenType vs TrueType (was current version of unicode-font)

2004-12-07 Thread Gary P Grosso
Thanks to Peter Constable, John Hudson, Tom Gewecke, Christopher Fynn, and others, for taking the time to address my question. Gary --- Gary Grosso Arbortext, Inc. Ann Arbor, MI, USA

Re: Nicest UTF.. UTF-9, UTF-36, UTF-80, UTF-64, ...

2004-12-07 Thread Kenneth Whistler
Philippe stated, and I need to correct: > UTF-24 already exists as an encoding form (it is identical to UTF-32), if > you just consider that encoding forms just need to be able to represent a > valid code range within a single code unit. This is false. Unicode encoding forms exist by virtue of

Re: Unicode for words?

2004-12-07 Thread Doug Ewell
Richard Cook wrote: > Well, why stop with words, my lord? Why not just encode all sentences, > paragraphs, pages, chapters, books, libraries, or your higher level > unit of choice, for that matter. > ... > Whether you choose to associate a single glyph with your private-use > code point, or an en

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell replied: > Actually the Unicode Technical Committee.  But you are > correct: it is up > to the UTC to decide whether they want to redefine UTF-8 to permit > invalid sequences, which are to be interpreted as unknown characte

RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

2004-12-07 Thread Lars Kristan
Title: RE: Invalid UTF-8 sequences (was: Re: Nicest UTF) Doug Ewell wrote: > John Cowan wrote: > > > Windows filesystems do know what encoding they use.  But a > filename on > > a Unix(oid) file system is a mere sequence of octets, of > which only 00 > > and 2F are interpreted.  (Filenam

Re: No Invisible Character - NBSP at the start of a word

2004-12-07 Thread Peter Kirk
On 07/12/2004 07:52, Jony Rosenne wrote: ... Consequently, there is and cannot be anything wrong with Unicode (at least in this respect) and it does support "ANY sequence of Hebrew vowels and consonants". I do maintain that is some cases the typographic process would require out of band assistance

Re: proposals I wrote (and also, didn't write)

2004-12-07 Thread E. Keown
Elaine in Vancouver Dear Mark: Thanks, I guess. > This is the one I'm going to comment on, since it's > the one I know best. > I know that Michael Everson and I are working on a > Samaritan proposal, It appears to me that my proposal came first, no? By some months...I have some mate

Re: proposals I wrote (and also, didn't write)

2004-12-07 Thread E. Keown
Elaine Keown Vancouver Dear Philippe and Lists: > In all your searches and in your proposals, did you > try to segregate the proposed additional characters > into two separate categories: those needed > for inclusion within many modern studies, and those The Samaritan marks are sti

Re: Nicest UTF

2004-12-07 Thread Philippe Verdy
From: "D. Starner" <[EMAIL PROTECTED]> If you're talking about a language that hides the structure of strings and has no problem with variable length data, then it wouldn't matter what the internal processing of the string looks like. You'd need to use iterators and discourage the use of arbitrary

Re: Unicode for words?

2004-12-07 Thread Richard Cook
On Dec 5, 2004, at 07:02 PM, Doug Ewell wrote: A word-based encoding for English could automatically assume spaces where they are appropriate. The sentence: "What means this, my lord?" would have seven encodable elements: the five words, the comma, and the question mark. Spaces would be automatic

RE: No Invisible Character - NBSP at the start of a word

2004-12-07 Thread Jony Rosenne
In chapter 8, regarding Hebrew, the standard says: Positioning. Marks may combine with vowels and other points, and there are complex typographic rules for positioning these combinations. I understand that this sentence should be regarded as being normative. Clause 4.3 uses the word "tend". Ch