Re: Cursor movement in Hebrew, was: Non-ascii string processing?

2003-10-09 Thread Ted Hopp
One issue with deleting a DGC non-atomically is that deleting only the base character can lead to all sorts of strange and problematic combining character sequences. At a minimum, deleting a base character should delete the entire DGC atomically. In Hebrew, I don't see any problem with deleting com

Re: Cursor movement in Hebrew, was: Non-ascii string processing?

2003-10-09 Thread Mark E. Shoulson
Peter Kirk wrote: On 08/10/2003 21:55, Jungshik Shin wrote: ... I've got a question about the cursor movement and selection in Hebrew text with such a grapheme (made up of 6 Unicode characters). What would be ordinary users' expectation when delete, backspace, and arrow keys(for cursor movement

Cursor movement in Hebrew, was: Non-ascii string processing?

2003-10-09 Thread Peter Kirk
On 08/10/2003 21:55, Jungshik Shin wrote: ... I've got a question about the cursor movement and selection in Hebrew text with such a grapheme (made up of 6 Unicode characters). What would be ordinary users' expectation when delete, backspace, and arrow keys(for cursor movement) are pressed arou

Re: Non-ascii string processing?

2003-10-08 Thread Jungshik Shin
On Tue, 7 Oct 2003, Peter Kirk wrote: > On 07/10/2003 04:35, Jill Ramonsky wrote: > Anyway, DGCs are not always what you want to work with. Besides, DGCs are just for the default and are not the absolute invariant atomic unit that can never be broken. In some situations, delete operation and c

Re: Non-ascii string processing?

2003-10-08 Thread Jungshik Shin
On Tue, 7 Oct 2003, Peter Kirk wrote: > On 07/10/2003 05:29, Marco Cimarosti wrote: > I could imagine a dialect of Basic which had separate string handling > functions for UTF-8 bytes and for characters. This is how the In Perl 5.8 or later that uses UTF-8, unless otherwise explicitly specifie

Re: Non-ascii string processing?

2003-10-08 Thread John Cowan
Doug Ewell scripsit: > But the "undefined characters" issue is a greater problem. Limiting the > pool of valid name characters to those already assigned in Unicode X.X > would mean either: > > (a) the XML spec would have to be updated promptly, 1 to 2 times per > year, to keep up with each new m

Re: Non-ascii string processing?

2003-10-08 Thread jon
> Of course it would have been possible to handle the "Astral Planes" > > uniformly by making every character in them a legal Char, but not a > valid name character or name start character. This would have avoided > silliness like elements named after the musical symbol for a six > string fretb

Re: Non-ascii string processing?

2003-10-08 Thread Doug Ewell
Elliotte Rusty Harold wrote: > Of course it would have been possible to handle the "Astral Planes" > uniformly by making every character in them a legal Char, but not a > valid name character or name start character. This would have avoided > silliness like elements named after the musical symbol

Re: Non-ascii string processing?

2003-10-08 Thread Elliotte Rusty Harold
At 8:08 AM -0400 10/8/03, John Cowan wrote: No, it doesn't. There was a strong feeling in the W3C Core WG that it be possible to handle the Astral Planes uniformly; every character off the BMP, therefore, is a valid Char as well as a valid NameStartChar. Of course it would have been possible to

Re: Non-ascii string processing?

2003-10-08 Thread jon
John Cowan <[EMAIL PROTECTED]> wrote : > [EMAIL PROTECTED] > scripsit: > > > (The XML1.1 spec removes a few of those characters, I would have > > removed more, but that's another issue). > > You have no idea what fearful drubbings I had to administer to get > even the few removed that I did. We

Re: Non-ascii string processing?

2003-10-08 Thread John Cowan
[EMAIL PROTECTED] scripsit: > (The XML1.1 spec removes a few of those characters, I would have > removed more, but that's another issue). You have no idea what fearful drubbings I had to administer to get even the few removed that I did. > [D]oes ISO 10646 allow those characters even though Unic

RE: Non-ascii string processing?

2003-10-08 Thread jon
> > A W3C XML Schema Language validator needs a character based API to > > correctly implement the minLength and maxLength facets on xsd:string > > As far as I understand, xsd:string is a list of "Character"-s, and > a > "Character" is an integer which can hold any valid Unicode code > point. N

Re: Non-ascii string processing? - count display units

2003-10-07 Thread Markus Scherer
You might want to look at East Asian Width http://unicode.org/reports/tr11/ for an approximation of the green-screen width of a string. To be absolutely precise, you need feedback from your green-screen layout engine and its font, of course, like you do for a graphical display. markus Edward H

RE: Non-ascii string processing?

2003-10-07 Thread Francois Yergeau
Marco Cimarosti wrote: > As far as I understand, xsd:string is a list of "Character"-s, and a > "Character" is an integer which can hold any valid Unicode code point. Not quite. XML Schema points to XML for its definition of character, and XML in turn says "A character is an atomic unit of text a

Re: What things are called (was Non-ascii string processing)

2003-10-07 Thread Peter Kirk
On 07/10/2003 08:42, Doug Ewell wrote: ... The Book of Genesis would be an awfully thin "book" if it appeared on the shelf individually. ... Not that thin, actually - 85 pages in my Hebrew Bible. But some of the "books", e.g. Obadiah and 2 and 3 John, fit easily on one page. So your point sta

RE: Non-ascii string processing?

2003-10-07 Thread Marco Cimarosti
Elliotte Rusty Harold wrote: > A W3C XML Schema Language validator needs a character based API to > correctly implement the minLength and maxLength facets on xsd:string As far as I understand, xsd:string is a list of "Character"-s, and a "Character" is an integer which can hold any valid Unicode

RE: Non-ascii string processing?

2003-10-07 Thread jon
> A W3C XML Schema Language validator needs a character based API to > correctly implement the minLength and maxLength facets on xsd:string > and types derived from it. Perhpas you would argue that the schema > language should itself be written in terms of grapheme clusters > rather than charac

Re: What things are called (was Non-ascii string processing)

2003-10-07 Thread Doug Ewell
Jill Ramonsky wrote... Well, one thing she wrote was: > :-) OK, that's out of the way. What follows is not necessarily 100% serious. > I have invented a new system, Unilib, for organising books in a > library. > > ... Except that you're not allowed to call them "books" any more, > because I've

Re: Non-ascii string processing?

2003-10-07 Thread John Delacour
At 4:20 am -0700 7/10/03, Peter Kirk wrote: Suppose I have a UTF-8 string and want to know how many default grapheme clusters it contains. How do I do so? Well, I step through the string character by character, combining successive characters into grapheme clusters. To do this without having

RE: Non-ascii string processing?

2003-10-07 Thread Elliotte Rusty Harold
At 12:35 PM +0100 10/7/03, Jill Ramonsky wrote: I have yet to see an APPLICATION which needs a character-based API. Jill A W3C XML Schema Language validator needs a character based API to correctly implement the minLength and maxLength facets on xsd:string and types derived from it. Perhpas you

Re: Non-ascii string processing?

2003-10-07 Thread Peter Kirk
On 07/10/2003 05:29, Marco Cimarosti wrote: Peter Kirk wrote: For i% = 1 to Len(utf8string$) c$ = Mid(utf8string$, i%, 1) Process c$ Next i% Such a loop would be more efficient in UTF-32 of course, but this is still a real need for working with character counts. If the string type a

RE: Non-ascii string processing?

2003-10-07 Thread jon
> No. What you have demonstrated below is that given an API > based on characters, one can write an API based on > default grapheme clusters. Nonetheless, it is only the > resulting default-grapheme-cluster-based API which would > actually be of any use to end-users. How close to the "end" do

RE: What things are called (was Non-ascii string processing)

2003-10-07 Thread Jill Ramonsky
hat things are called (was Non-ascii string processing) > > > Jill Ramonsky wrote: > > Hey - the public will just have to get used to it! > > No, the public should not be bored with these technical > details: in the user > manual, a "book" will still be a "book

RE: Non-ascii string processing?

2003-10-07 Thread Marco Cimarosti
Peter Kirk wrote: > For i% = 1 to Len(utf8string$) > c$ = Mid(utf8string$, i%, 1) > Process c$ > Next i% > > Such a loop would be more efficient in UTF-32 of course, but this is > still a real need for working with character counts. If the string type and function of this Basic dialect i

RE: What things are called (was Non-ascii string processing)

2003-10-07 Thread Marco Cimarosti
Jill Ramonsky wrote: > Hey - the public will just have to get used to it! No, the public should not be bored with these technical details: in the user manual, a "book" will still be a "book". The fact that, in the source code of the application "book" means something else if of interest only to pr

Re: Non-ascii string processing?

2003-10-07 Thread Peter Kirk
On 07/10/2003 04:35, Jill Ramonsky wrote: No. What you have demonstrated below is that given an API based on characters, one can write an API based on default grapheme clusters. Nonetheless, it is only the /_resulting _/default-grapheme-cluster-based API which would actually be of any use to e

Re: What things are called (was Non-ascii string processing)

2003-10-07 Thread jon
> (2) The object currently called a "character" be renamed as something > > like "mapped codepoint" or "encoded codepoint", or possibly > (coming in > from the other end) something like "sub-character" or "character > > component" or "characterette" (which can be shortened to > "charette" and >

RE: Non-ascii string processing?

2003-10-07 Thread Jill Ramonsky
[EMAIL PROTECTED]] > Sent: Tuesday, October 07, 2003 12:20 PM > To: Jill Ramonsky > Cc: [EMAIL PROTECTED] > Subject: Re: Non-ascii string processing? > > > On 07/10/2003 02:35, Jill Ramonsky wrote: > > > > > Knowing the number of characters won't help you one iota

Re: Non-ascii string processing?

2003-10-07 Thread Peter Kirk
On 07/10/2003 02:35, Jill Ramonsky wrote: Knowing the number of characters won't help you one iota. What you need to know here is the number of default grapheme clusters. I still have yet to hear a useful purpose for counting the number of /characters/. Jill Suppose I have a UTF-8 string and w

RE: What things are called (was Non-ascii string processing)

2003-10-07 Thread Jill Ramonsky
I have invented a new system, Unilib, for organising books in a library. ... Except that you're not allowed to call them "books" any more, because I've already redefined the word "book" to mean "the physical expression of a catalogue entry". Since what the user normally experiences as a book ma

What things are called (was Non-ascii string processing)

2003-10-07 Thread Jill Ramonsky
Sigh! Things were a lot easier back in the old days of Unicode version 3, when default grapheme clusters were still called "glyphs". Okay, so the general public still got it wrong, but that was just because they were ignorant monkeys who didn't know any better, and it was up to the likes of us

RE: Non-ascii string processing?

2003-10-07 Thread jon
> Now - a count of DEFAULT GRAPHEME CLUSTERs might be useful (for example, > for display on a console which uses fixed-width fonts). Indeed, a whole > class of DEFAULT GRAPHEME CLUSTER handling functions might come in very > handy indeed. Bytes are useful. Default grapheme clusters are useful.

RE: Non-ascii string processing?

2003-10-07 Thread Jill Ramonsky
ilto:[EMAIL PROTECTED] > Sent: Monday, October 06, 2003 6:11 PM > To: [EMAIL PROTECTED] > Cc: Marco Cimarosti > Subject: Re: Non-ascii string processing? > > > Well, I know a good use for it: a console or terminal-based > application which > displays information using fixed-widt

RE: Non-ascii string processing?

2003-10-07 Thread Jill Ramonsky
'Doug Ewell'; Unicode Mailing > List; Theodore > H. Smith > Subject: Re: Non-ascii string processing? > > > Tell that to the editor (editors of paper publications still talk with > this unit "3 000 characters, no more, for tommorrow morning").

Bogus UTF's are back! :-) (was RE: Non-ascii string processing?)

2003-10-07 Thread Marco Cimarosti
Doug Ewell wrote: > [...] > > we'd all use UTF-336. Er? > > If only I had a bit more spare time, Jill. You do NOT want to get me > started... >:-) Go for it, Doug! :-) If I only had a bit of spare time myself, I'd be eager of running bits-per-character statistics for UTF:-)336 in various l

Re: Non-ascii string processing?

2003-10-06 Thread Doug Ewell
Jill Ramonsky wrote: > But then, a default grapheme cluster might theoretically require up to > 16 Unicode characters. (Maybe more, I don't know). Even bit-packed to > 21 bits per character, that still gives us 336 bits. So I conclude > that our string processing functions could go a lot faster i

Re: Non-ascii string processing?

2003-10-06 Thread Edward H. Trager
On Monday 2003.10.06 21:36:13 +0200, Marco Cimarosti wrote: > Edward H. Trager wrote: > > > But I still don't see any use in knowing how many > > characters are in an UTF-8 > > > string, apart the use that I already mentioned: allocating > > a buffer for a > > > UTF-8 to UTF-32 conversion. > > >

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti
Edward H. Trager wrote: > > But I still don't see any use in knowing how many > characters are in an UTF-8 > > string, apart the use that I already mentioned: allocating > a buffer for a > > UTF-8 to UTF-32 conversion. > > Well, I know a good use for it: a console or terminal-based > applicatio

RE: Non-ascii string processing?

2003-10-06 Thread Jill Ramonsky
Could you try that again with codepoints > U+ please? I'd be curious to know what happens. Jill > -Original Message- > From: John Delacour [mailto:[EMAIL PROTECTED] > Sent: Monday, October 06, 2003 2:15 PM > To: [EMAIL PROTECTED] > Subject: RE: Non-ascii stri

RE: Non-ascii string processing?

2003-10-06 Thread Jill Ramonsky
7;Doug Ewell'; Unicode Mailing List > Cc: Theodore H. Smith > Subject: RE: Non-ascii string processing? > > > What strlen() cannot do is countîng the number of > *characters* in a string. > But who cares? I can imagine very few situations where someone such an > information would be useful. > > _ Marco >

Re: Non-ascii string processing?

2003-10-06 Thread Edward H. Trager
On Monday 2003.10.06 17:15:25 +0200, Marco Cimarosti wrote: > Stephane Bortzmeyer wrote: > > > OK. But the length in "characters" of a string is not > > "character semantics": > > > it's plain nonsense, IMHO. > > > > I disagree. > > Feel free. > > But I still don't see any use in knowing how ma

RE: Non-ascii string processing?

2003-10-06 Thread jon
> But I still don't see any use in knowing how many characters are in an UTF-8 > string, apart the use that I already mentioned: allocating a buffer for a > UTF-8 to UTF-32 conversion. I wouldn't use it for that at all. I'd assume a worse-case of 32-bit word in the UTF-32 per octet in the UTF-8 o

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti
Stephane Bortzmeyer wrote: > > OK. But the length in "characters" of a string is not > "character semantics": > > it's plain nonsense, IMHO. > > I disagree. Feel free. But I still don't see any use in knowing how many characters are in an UTF-8 string, apart the use that I already mentioned: al

Re: Non-ascii string processing?

2003-10-06 Thread jon
> > a word like "élite" is always counted as five characters, > regardless > > that it might be encoded as six Unicode "characters". > > I assume that everybody on this list knows that you count characters > only after a proper normalization... (like many operations on Unicode > texts). A word li

RE: Non-ascii string processing?

2003-10-06 Thread John Delacour
At 12:09 pm +0200 6/10/03, Marco Cimarosti wrote: What strlen() cannot do is countîng the number of *characters* in a string. But who cares? I can imagine very few situations where someone such an information would be useful. #!/usr/bin/perl print "ab, \x{}\x{aaab}" ; printf "\n%s, %s", le

Re: Non-ascii string processing?

2003-10-06 Thread 'Stephane Bortzmeyer'
On Mon, Oct 06, 2003 at 01:52:26PM +0200, Marco Cimarosti <[EMAIL PROTECTED]> wrote a message of 51 lines which said: > a word like "élite" is always counted as five characters, regardless > that it might be encoded as six Unicode "characters". I assume that everybody on this list knows that y

Re: Non-ascii string processing?

2003-10-06 Thread jon
> > If you really aren't processing anything but the ASCII characters > > within > > your strings, like "<" and ">" in your example, > you can probably get > > away with keeping your existing byte-oriented code. At least you won't > > get false matches on the ASCII characters (this was a primary

Re: Non-ascii string processing?

2003-10-06 Thread Peter Kirk
On 06/10/2003 03:09, Marco Cimarosti wrote: Doug Ewell wrote: Depends on what "processing" you are talking about. Just to cite the most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented strlen() will fail dramatically. Why? The purpose of strlen() is counting the number of

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti
Stephane Bortzmeyer wrote: > On Mon, Oct 06, 2003 at 12:09:34PM +0200, > Marco Cimarosti <[EMAIL PROTECTED]> wrote > a message of 14 lines which said: > > > What strlen() cannot do is countîng the number of > *characters* in a string. > > But who cares? I can imagine very few situations where

Re: Non-ascii string processing?

2003-10-06 Thread Stephane Bortzmeyer
On Mon, Oct 06, 2003 at 12:09:34PM +0200, Marco Cimarosti <[EMAIL PROTECTED]> wrote a message of 14 lines which said: > What strlen() cannot do is countîng the number of *characters* in a string. > But who cares? I can imagine very few situations where someone such an > information would be use

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti
Theodore H. Smith wrote: > Hi lists, Hi, member. > I'm wondering how people tend to do their non-ascii string processing. I think no one has been doing ASCII string processing for decades. :-) But I guess you meant non-SBCS ("single byte character set") string processin

RE: Non-ascii string processing?

2003-10-06 Thread Marco Cimarosti
Doug Ewell wrote: > Depends on what "processing" you are talking about. Just to cite the > most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented > strlen() will fail dramatically. Why? The purpose of strlen() is counting the number of *bytes* needed to store a certain string, and

Re: Non-ascii string processing?

2003-10-05 Thread Doug Ewell
Theodore H. Smith wrote: >> If you really aren't processing anything but the ASCII characters >> within your strings, like "<" and ">" in your example, you can >> probably get away with keeping your existing byte-oriented code. >> At least you won't get false matches on the ASCII characters (this

Re: Non-ascii string processing?

2003-10-05 Thread Theodore H. Smith
Hi Doug, heres some things I think. If you really aren't processing anything but the ASCII characters within your strings, like "<" and ">" in your example, you can probably get away with keeping your existing byte-oriented code. At least you won't get false matches on the ASCII characters (thi

Re: Non-ascii string processing?

2003-10-04 Thread Doug Ewell
Theodore H. Smith wrote: > I'm wondering how people tend to do their non-ascii string processing. > > I'm wondering, if anyone really needs anything other than byte > oriented code? I'm using UTF8 as my character format, and UTF8 is > variable width, of course. I

Non-ascii string processing?

2003-10-04 Thread Theodore H. Smith
Hi lists, I'm wondering how people tend to do their non-ascii string processing. I'm wondering, if anyone really needs anything other than byte oriented code? I'm using UTF8 as my character format, and UTF8 is variable width, of course. I offer the option of processing