One issue with deleting a DGC non-atomically is that deleting only the base
character can lead to all sorts of strange and problematic combining
character sequences. At a minimum, deleting a base character should delete
the entire DGC atomically. In Hebrew, I don't see any problem with deleting
com
Peter Kirk wrote:
On 08/10/2003 21:55, Jungshik Shin wrote:
...
I've got a question about the cursor movement and
selection in Hebrew text with such a grapheme (made up of 6 Unicode
characters). What would be ordinary users' expectation when delete,
backspace, and arrow keys(for cursor movement
On 08/10/2003 21:55, Jungshik Shin wrote:
...
I've got a question about the cursor movement and
selection in Hebrew text with such a grapheme (made up of 6 Unicode
characters). What would be ordinary users' expectation when delete,
backspace, and arrow keys(for cursor movement) are pressed arou
On Tue, 7 Oct 2003, Peter Kirk wrote:
> On 07/10/2003 04:35, Jill Ramonsky wrote:
> Anyway, DGCs are not always what you want to work with.
Besides, DGCs are just for the default and are not the
absolute invariant atomic unit that can never be broken. In some
situations, delete operation and c
On Tue, 7 Oct 2003, Peter Kirk wrote:
> On 07/10/2003 05:29, Marco Cimarosti wrote:
> I could imagine a dialect of Basic which had separate string handling
> functions for UTF-8 bytes and for characters. This is how the
In Perl 5.8 or later that uses UTF-8, unless otherwise explicitly
specifie
Doug Ewell scripsit:
> But the "undefined characters" issue is a greater problem. Limiting the
> pool of valid name characters to those already assigned in Unicode X.X
> would mean either:
>
> (a) the XML spec would have to be updated promptly, 1 to 2 times per
> year, to keep up with each new m
> Of course it would have been possible to handle the "Astral Planes"
>
> uniformly by making every character in them a legal Char, but not a
> valid name character or name start character. This would have avoided
> silliness like elements named after the musical symbol for a six
> string fretb
Elliotte Rusty Harold wrote:
> Of course it would have been possible to handle the "Astral Planes"
> uniformly by making every character in them a legal Char, but not a
> valid name character or name start character. This would have avoided
> silliness like elements named after the musical symbol
At 8:08 AM -0400 10/8/03, John Cowan wrote:
No, it doesn't. There was a strong feeling in the W3C Core WG that
it be possible to handle the Astral Planes uniformly; every character
off the BMP, therefore, is a valid Char as well as a valid NameStartChar.
Of course it would have been possible to
John Cowan <[EMAIL PROTECTED]> wrote :
> [EMAIL PROTECTED]
> scripsit:
>
> > (The XML1.1 spec removes a few of those characters, I would have
> > removed more, but that's another issue).
>
> You have no idea what fearful drubbings I had to administer to get
> even the few removed that I did.
We
[EMAIL PROTECTED] scripsit:
> (The XML1.1 spec removes a few of those characters, I would have
> removed more, but that's another issue).
You have no idea what fearful drubbings I had to administer to get
even the few removed that I did.
> [D]oes ISO 10646 allow those characters even though Unic
> > A W3C XML Schema Language validator needs a character based API to
> > correctly implement the minLength and maxLength facets on xsd:string
>
> As far as I understand, xsd:string is a list of "Character"-s, and
> a
> "Character" is an integer which can hold any valid Unicode code
> point.
N
You might want to look at East Asian Width http://unicode.org/reports/tr11/ for an approximation of
the green-screen width of a string.
To be absolutely precise, you need feedback from your green-screen layout engine and its font, of
course, like you do for a graphical display.
markus
Edward H
Marco Cimarosti wrote:
> As far as I understand, xsd:string is a list of "Character"-s, and a
> "Character" is an integer which can hold any valid Unicode code point.
Not quite. XML Schema points to XML for its definition of character, and
XML in turn says "A character is an atomic unit of text a
On 07/10/2003 08:42, Doug Ewell wrote:
...
The Book of Genesis would be an awfully thin "book" if it
appeared on the shelf individually. ...
Not that thin, actually - 85 pages in my Hebrew Bible. But some of the
"books", e.g. Obadiah and 2 and 3 John, fit easily on one page. So your
point sta
Elliotte Rusty Harold wrote:
> A W3C XML Schema Language validator needs a character based API to
> correctly implement the minLength and maxLength facets on xsd:string
As far as I understand, xsd:string is a list of "Character"-s, and a
"Character" is an integer which can hold any valid Unicode
> A W3C XML Schema Language validator needs a character based API to
> correctly implement the minLength and maxLength facets on xsd:string
> and types derived from it. Perhpas you would argue that the schema
> language should itself be written in terms of grapheme clusters
> rather than charac
Jill Ramonsky wrote... Well, one
thing she wrote was:
> :-)
OK, that's out of the way. What follows is not necessarily 100%
serious.
> I have invented a new system, Unilib, for organising books in a
> library.
>
> ... Except that you're not allowed to call them "books" any more,
> because I've
At 4:20 am -0700 7/10/03, Peter Kirk wrote:
Suppose I have a UTF-8 string and want to know
how many default grapheme clusters it contains.
How do I do so? Well, I step through the string
character by character, combining successive
characters into grapheme clusters. To do this
without having
At 12:35 PM +0100 10/7/03, Jill Ramonsky wrote:
I have yet to see an APPLICATION which needs a character-based API.
Jill
A W3C XML Schema Language validator needs a character based API to
correctly implement the minLength and maxLength facets on xsd:string
and types derived from it. Perhpas you
On 07/10/2003 05:29, Marco Cimarosti wrote:
Peter Kirk wrote:
For i% = 1 to Len(utf8string$)
c$ = Mid(utf8string$, i%, 1)
Process c$
Next i%
Such a loop would be more efficient in UTF-32 of course, but this is
still a real need for working with character counts.
If the string type a
> No. What you have demonstrated below is that given an API
> based on characters, one can write an API based on
> default grapheme clusters. Nonetheless, it is only the
> resulting default-grapheme-cluster-based API which would
> actually be of any use to end-users.
How close to the "end" do
hat things are called (was Non-ascii string processing)
>
>
> Jill Ramonsky wrote:
> > Hey - the public will just have to get used to it!
>
> No, the public should not be bored with these technical
> details: in the user
> manual, a "book" will still be a "book
Peter Kirk wrote:
> For i% = 1 to Len(utf8string$)
> c$ = Mid(utf8string$, i%, 1)
> Process c$
> Next i%
>
> Such a loop would be more efficient in UTF-32 of course, but this is
> still a real need for working with character counts.
If the string type and function of this Basic dialect i
Jill Ramonsky wrote:
> Hey - the public will just have to get used to it!
No, the public should not be bored with these technical details: in the user
manual, a "book" will still be a "book". The fact that, in the source code
of the application "book" means something else if of interest only to
pr
On 07/10/2003 04:35, Jill Ramonsky wrote:
No. What you have demonstrated below is that given an API based on
characters, one can write an API based on default grapheme clusters.
Nonetheless, it is only the /_resulting
_/default-grapheme-cluster-based API which would actually be of any
use to e
> (2) The object currently called a "character" be renamed as something
>
> like "mapped codepoint" or "encoded codepoint", or possibly
> (coming in
> from the other end) something like "sub-character" or "character
>
> component" or "characterette" (which can be shortened to
> "charette" and
>
[EMAIL PROTECTED]]
> Sent: Tuesday, October 07, 2003 12:20 PM
> To: Jill Ramonsky
> Cc: [EMAIL PROTECTED]
> Subject: Re: Non-ascii string processing?
>
>
> On 07/10/2003 02:35, Jill Ramonsky wrote:
>
> >
> > Knowing the number of characters won't help you one iota
On 07/10/2003 02:35, Jill Ramonsky wrote:
Knowing the number of characters won't help you one iota. What you
need to know here is the number of default grapheme clusters.
I still have yet to hear a useful purpose for counting the number of
/characters/.
Jill
Suppose I have a UTF-8 string and w
I have invented a new system, Unilib, for organising books in a library.
... Except that you're not allowed to call them "books" any more,
because I've already redefined the word "book" to mean "the physical
expression of a catalogue entry". Since what the user normally
experiences as a book ma
Sigh! Things were a lot easier back in the old days of Unicode version
3, when default grapheme clusters were still called "glyphs". Okay, so
the general public still got it wrong, but that was just because they
were ignorant monkeys who didn't know any better, and it was up to the
likes of us
> Now - a count of DEFAULT GRAPHEME CLUSTERs might be useful (for example,
> for display on a console which uses fixed-width fonts). Indeed, a whole
> class of DEFAULT GRAPHEME CLUSTER handling functions might come in very
> handy indeed. Bytes are useful. Default grapheme clusters are useful.
ilto:[EMAIL PROTECTED]
> Sent: Monday, October 06, 2003 6:11 PM
> To: [EMAIL PROTECTED]
> Cc: Marco Cimarosti
> Subject: Re: Non-ascii string processing?
>
>
> Well, I know a good use for it: a console or terminal-based
> application which
> displays information using fixed-widt
'Doug Ewell'; Unicode Mailing
> List; Theodore
> H. Smith
> Subject: Re: Non-ascii string processing?
>
>
> Tell that to the editor (editors of paper publications still talk with
> this unit "3 000 characters, no more, for tommorrow morning").
Doug Ewell wrote:
> [...]
> > we'd all use UTF-336. Er?
>
> If only I had a bit more spare time, Jill. You do NOT want to get me
> started... >:-)
Go for it, Doug! :-)
If I only had a bit of spare time myself, I'd be eager of running
bits-per-character statistics for UTF:-)336 in various l
Jill Ramonsky wrote:
> But then, a default grapheme cluster might theoretically require up to
> 16 Unicode characters. (Maybe more, I don't know). Even bit-packed to
> 21 bits per character, that still gives us 336 bits. So I conclude
> that our string processing functions could go a lot faster i
On Monday 2003.10.06 21:36:13 +0200, Marco Cimarosti wrote:
> Edward H. Trager wrote:
> > > But I still don't see any use in knowing how many
> > characters are in an UTF-8
> > > string, apart the use that I already mentioned: allocating
> > a buffer for a
> > > UTF-8 to UTF-32 conversion.
> >
>
Edward H. Trager wrote:
> > But I still don't see any use in knowing how many
> characters are in an UTF-8
> > string, apart the use that I already mentioned: allocating
> a buffer for a
> > UTF-8 to UTF-32 conversion.
>
> Well, I know a good use for it: a console or terminal-based
> applicatio
Could you try that again with codepoints > U+ please? I'd be curious
to know what happens.
Jill
> -Original Message-
> From: John Delacour [mailto:[EMAIL PROTECTED]
> Sent: Monday, October 06, 2003 2:15 PM
> To: [EMAIL PROTECTED]
> Subject: RE: Non-ascii stri
7;Doug Ewell'; Unicode Mailing List
> Cc: Theodore H. Smith
> Subject: RE: Non-ascii string processing?
>
>
> What strlen() cannot do is countîng the number of
> *characters* in a string.
> But who cares? I can imagine very few situations where someone such an
> information would be useful.
>
> _ Marco
>
On Monday 2003.10.06 17:15:25 +0200, Marco Cimarosti wrote:
> Stephane Bortzmeyer wrote:
> > > OK. But the length in "characters" of a string is not
> > "character semantics":
> > > it's plain nonsense, IMHO.
> >
> > I disagree.
>
> Feel free.
>
> But I still don't see any use in knowing how ma
> But I still don't see any use in knowing how many characters are in an UTF-8
> string, apart the use that I already mentioned: allocating a buffer for a
> UTF-8 to UTF-32 conversion.
I wouldn't use it for that at all. I'd assume a worse-case of 32-bit word in the
UTF-32 per octet in the UTF-8 o
Stephane Bortzmeyer wrote:
> > OK. But the length in "characters" of a string is not
> "character semantics":
> > it's plain nonsense, IMHO.
>
> I disagree.
Feel free.
But I still don't see any use in knowing how many characters are in an UTF-8
string, apart the use that I already mentioned: al
> > a word like "élite" is always counted as five characters,
> regardless
> > that it might be encoded as six Unicode "characters".
>
> I assume that everybody on this list knows that you count characters
> only after a proper normalization... (like many operations on Unicode
> texts).
A word li
At 12:09 pm +0200 6/10/03, Marco Cimarosti wrote:
What strlen() cannot do is countîng the number of *characters* in a string.
But who cares? I can imagine very few situations where someone such an
information would be useful.
#!/usr/bin/perl
print "ab, \x{}\x{aaab}" ;
printf "\n%s, %s", le
On Mon, Oct 06, 2003 at 01:52:26PM +0200,
Marco Cimarosti <[EMAIL PROTECTED]> wrote
a message of 51 lines which said:
> a word like "élite" is always counted as five characters, regardless
> that it might be encoded as six Unicode "characters".
I assume that everybody on this list knows that y
> > If you really aren't processing anything but the ASCII characters
> > within
> > your strings, like "<" and ">" in your example,
> you can probably get
> > away with keeping your existing byte-oriented code. At least you won't
> > get false matches on the ASCII characters (this was a primary
On 06/10/2003 03:09, Marco Cimarosti wrote:
Doug Ewell wrote:
Depends on what "processing" you are talking about. Just to cite the
most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
strlen() will fail dramatically.
Why? The purpose of strlen() is counting the number of
Stephane Bortzmeyer wrote:
> On Mon, Oct 06, 2003 at 12:09:34PM +0200,
> Marco Cimarosti <[EMAIL PROTECTED]> wrote
> a message of 14 lines which said:
>
> > What strlen() cannot do is countîng the number of
> *characters* in a string.
> > But who cares? I can imagine very few situations where
On Mon, Oct 06, 2003 at 12:09:34PM +0200,
Marco Cimarosti <[EMAIL PROTECTED]> wrote
a message of 14 lines which said:
> What strlen() cannot do is countîng the number of *characters* in a string.
> But who cares? I can imagine very few situations where someone such an
> information would be use
Theodore H. Smith wrote:
> Hi lists,
Hi, member.
> I'm wondering how people tend to do their non-ascii string processing.
I think no one has been doing ASCII string processing for decades. :-) But I
guess you meant non-SBCS ("single byte character set") string processin
Doug Ewell wrote:
> Depends on what "processing" you are talking about. Just to cite the
> most obvious case, passing a non-ASCII, UTF-8 string to byte-oriented
> strlen() will fail dramatically.
Why? The purpose of strlen() is counting the number of *bytes* needed to
store a certain string, and
Theodore H. Smith wrote:
>> If you really aren't processing anything but the ASCII characters
>> within your strings, like "<" and ">" in your example, you can
>> probably get away with keeping your existing byte-oriented code.
>> At least you won't get false matches on the ASCII characters (this
Hi Doug,
heres some things I think.
If you really aren't processing anything but the ASCII characters
within
your strings, like "<" and ">" in your example, you can probably get
away with keeping your existing byte-oriented code. At least you won't
get false matches on the ASCII characters (thi
Theodore H. Smith wrote:
> I'm wondering how people tend to do their non-ascii string processing.
>
> I'm wondering, if anyone really needs anything other than byte
> oriented code? I'm using UTF8 as my character format, and UTF8 is
> variable width, of course. I
Hi lists,
I'm wondering how people tend to do their non-ascii string processing.
I'm wondering, if anyone really needs anything other than byte oriented
code? I'm using UTF8 as my character format, and UTF8 is variable
width, of course. I offer the option of processing
56 matches
Mail list logo