Greek: Heta

2003-10-04 Thread Nick Nicholas
Unicoders, in my extensive site on Greek Unicode issues, I discuss the 
representation of the old letter for /h/ in Greek inscriptions, at 
http://www.tlg.uci.edu/~opoudjis/unicode/unicode_aitch.html . As I say 
there, this is usually represented with a Latin h, as opposed to the 
breathing mark, but there is a minor tradition of representing it with 
a tack symbol, which is cased. The casing alone to my mind means that 
there are grounds for a heta lowercase and uppercase to be proposed, 
with its reference glyph as the tack but with a glyph variant as the 
Latin h; but I don't know how extensively the tack heta is in use. 
Could people have a look at the page and comment?

===
 O Roeschen Roth! Der Mensch liegt in tiefster Noth! Der Mensch liegt in
 tiefster Pein!  Je lieber moecht'  ich im Himmel sein!   ---  _Urlicht_
[EMAIL PROTECTED]http://www.opoudjis.net
Dr Nick NICHOLAS,  French & Italian,  Univ. of Melbourne, Australia



Re: Non-ascii string processing?

2003-10-04 Thread Doug Ewell
Theodore H. Smith  wrote:

> I'm wondering how people tend to do their non-ascii string processing.
>
> I'm wondering, if anyone really needs anything other than byte
> oriented code? I'm using UTF8 as my character format, and UTF8 is
> variable width, of course. I offer the option of processing UTF8, with
> byte functions, however.
>
> EG:
>
> Start = MyString.InStr( "<" )
> End = MyString.InStr( Start + 1, "> )
>
> things like this, it really doesn't matter if your data is UTF8, you
> can still process it like bytes! Leading to faster speed, and simpler
> code.

If you really aren't processing anything but the ASCII characters within
your strings, like "<" and ">" in your example, you can probably get
away with keeping your existing byte-oriented code.  At least you won't
get false matches on the ASCII characters (this was a primary design
goal of UTF-8).

However, if your goal is to simplify processing of arbitrary UTF-8 text,
including non-ASCII characters, I haven't found a better way than to
read in the UTF-8, convert it on the fly to UTF-32, and THEN do your
processing on the fixed-width UTF-32.  That way you don't have to do one
thing for Basic Latin characters and something else for the rest.

You will probably hear from some very prominent Unicode people that
converting to UTF-16 is better, because "most" characters are in the
BMP, for which UTF-16 uses half as much memory.  But this approach
doesn't really solve the variable-width problem -- it merely moves it,
from "ASCII vs. non-ASCII" to "BMP vs. non-BMP."  Unless you are keeping
large amounts of text in memory, or are working with a small device such
as a handheld, the extra size of UTF-32 compared to UTF-16 is unlikely
to be a big problem, and you have the advantage of dealing with a
fixed-width representation for the entire Unicode code space.

All of this assumes that you don't have multi-character processing
issues, like combining characters and normalization, or culturally
appropriate sorting, in which case your character processing WILL be
more complex than ASCII no matter which CES you use.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Non-ascii string processing?

2003-10-04 Thread Theodore H. Smith
Hi lists,

I'm wondering how people tend to do their non-ascii string processing.

I'm wondering, if anyone really needs anything other than byte oriented 
code? I'm using UTF8 as my character format, and UTF8 is variable 
width, of course. I offer the option of processing UTF8, with byte 
functions, however.

EG:

Start = MyString.InStr( "<" )
End = MyString.InStr( Start + 1, "> )
things like this, it really doesn't matter if your data is UTF8, you 
can still process it like bytes! Leading to faster speed, and simpler 
code.

So, I'm wondering, in fact, is there ANY code that needs explicit UTF8 
processing? Heres a few I've thought of.

1) Spell checking - needs UTF8 character based iteration
2) lexical processing - needs UTF8 mode to be able to match "å" to "a".
Can anyone tell me any more? Please feel free to go into great detail 
in your answers. The more detail the better.

Thanks a lot!

I'm just wondering if I can simplify my string processing library, and 
if anyone really needs anything except byte-level processing, for most 
functions, except maybe a few for the two I mentioned above!