Re: Nicest UTF

2004-12-03 Thread Theo
From: Asmus Freytag [EMAIL PROTECTED]

I use ... and UTF-32 for most internal processing that I write
myself.  Let people say UTF-32 is wasteful if they want; I don't tend 
to
store huge amounts of text in memory at once, so the overhead is much
less important than one code unit per character.

For performance-critical applications on the other hand, you need to 
use
whichever UTF gives you the correct balance in speed and average 
storage
size for your data.

If you have very large amounts of data, you'll be sensitive to cache
overruns. Enough so, that UTF-32 may be disqualified from the start.
I have encountered systems for which that was true.
For both of these, I'd recommend UTF-8. Its compact, especially when 
parsing source code! which is mostly ASCII even if it contains other 
languages)... and its fast to process. Just use the byte processing 
functions.

I've done natural language word processing functions on UTF-8 also, and 
its damn fast even there, even despite being case insensitive!

My test was to do word counting to see word frequencies. I did this 
with UTF-8. All strings were entered as UTF-8 into a special scanner 
which I invented. All strings were entered as both uppercase and 
lowercase. The scanner would then have both uppercase and lowercase 
variants of UTF-8.

The scanner however, only does byte (case sensitive) searching.
So, despite it being  UTF-8 case insensitive, it was totally blastingly 
fast. (One person reported counting words at 1MB/second of pure text, 
from within a mixed Basic / C environment). You'll need to keep in 
mind, that the counter must look up through thousands of words (Every 
single word its come across in the text), on every single word lookup.

Anyhow, from my experience, UTF-8 is great for speed and RAM.



Re: Public Review Issues Update

2004-10-21 Thread Theo Veenker
Mark Davis wrote:
All comments are reviewed at the next UTC meeting. Due to the volume, we
don't reply to each and every one what the disposition was. If actions were
taken, they are recorded in the minutes of the meetings.
But what if an action was not taken. Do I have to keep reporting a particular
problem until it's gone? Also, there is no way of telling which problems
already have been reported a dozen times before. Assuming the comments
reported are archived, why can't this archive be made accessible to the
unicode list?
Theo
P.S.
I know the Unicode Consortium is a non-profit organization and you are all
very very busy, and I look up to the people behind it. But I often find
it hard to see the part where the Unicode Consortium is actually promoting
use of the Unicode Standard. Sometimes it almost seems to discourage
developers to use the Standard. Take the Book for instance, one is not
allowed to print the online version, so I bought the book to find out
that 2 of its 3 kilograms is just tables of glyphs which could have been
on a CD. So you pay $75 to get value for $25.



Re: Public Review Issues Update

2004-10-20 Thread Theo Veenker
Rick McGowan wrote:
If you have comments for official UTC consideration, please post them by
submitting your comments through our feedback  reporting page:
http://www.unicode.org/reporting.html
Hi Rick,
Please tell us, why does one never get feedback on submitted comments regarding
Public Review Issues? The UTC is seeking public feedback, but apparently it's not
considered necessary to give responders feedback on their comments. There isn't
even an accessible list of reported 'problems' or comments (besides the errata
list). Why should I even bother responding if I'm not sure it isn't all in vain?
Best regards,
Theo



UAX 15 hangul composition

2004-08-03 Thread Theo Veenker
Don't know if this has been asked/reported before, but is the example code
for hangul composition in UAX 15 correct?
The code is:
public static String composeHangul(String source) {
int len = source.length();
if (len == 0) return ;
StringBuffer result = new StringBuffer();
char last = source.charAt(0);// copy first char
result.append(last);
for (int i = 1; i  len; ++i) {
char ch = source.charAt(i);
// 1. check to see if two current characters are L and V
int LIndex = last - LBase;
if (0 = LIndex  LIndex  LCount) {
int VIndex = ch - VBase;
if (0 = VIndex  VIndex  VCount) {
// make syllable of form LV
last = (char)(SBase + (LIndex * VCount + VIndex) * TCount);
result.setCharAt(result.length()-1, last); // reset last
continue; // discard ch
}
}
// 2. check to see if two current characters are LV and T
int SIndex = last - SBase;
if (0 = SIndex  SIndex  SCount  (SIndex % TCount) == 0) {
int TIndex = ch - TBase;
if (0 = TIndex  TIndex = TCount) {
// make syllable of form LVT
last += TIndex;
result.setCharAt(result.length()-1, last); // reset last
continue; // discard ch
}
}
// if neither case was true, just add the character
last = ch;
result.append(ch);
}
return result.toString();
}
Suppose I feed it 0xAC00 0x11C3. 0xAC00 is an LV.
This will do step 2:
SIndex = 0xAC00 - 0xAC00 = 0
TIndex = 0x11C3 - 0x11A7 = 28
Which causes the (0 = TIndex  TIndex = TCount) to be true.
And the resulting output is 0xAC00 + 28 = 0xAC1C which is not
an LVT but an LV syllable!
The TIndex = TCount should be TIndex  TCount I think. IMO the
example would be more clear if the Hangul_Syllable_Type property
would be used.
A somewhat related question. I know next to nothing about Hangul
[de]composition so forgive me for asking silly questions. In the
UnicodeData.txt file there are much more than the 19 L, 21 V, and
28 L jamos. Are the other jamos not use to compose syllables, or
does the syllable block represent an incomplete set of compatibility
characters? What's is it?
Theo



2nd attempt: final_sigma vs final_cased

2004-06-21 Thread Theo Veenker
Hi,
Is there somebody out there who can answer this question?
Casing context Final_Sigma is being used in SpecialCasing.txt, but its
specification is no longer present in the standard (at least I can't
find it). Obviously this context is now called Final_Cased, but the
specification for Final_Cased (section 3.13) is not identical to that
of Final_Sigma (UAX 21, superseded).
regexp Final_Cased:
Before C[{cased=true}][{wordBoundary!=true}]*
After C !([{wordBoundary!=true}]*[{cased=true}])
regexp Final_Sigma:
Before Ccased case-ignorable*
After C !(case-ignorable* cased)
Is the old specification of Final_Sigma still valid for determining
the final sigma casing context, or are there situations where it
is inadequate? What I mean is are these specification actually
the same WRT final sigma?
Theo
PS. I sometimes have the feeling that I'm on the wrong list here.
Most discussions are about characters, almost never about
implementation issues. Is there a unicode developers list
perhaps that I'm unaware of?



final_sigma vs final_cased

2004-06-14 Thread Theo Veenker
Hi,
Casing context Final_Sigma is being used in SpecialCasing.txt, but its
specification is no longer present in the standard (at least I can't
find it). Obviously this context is now called Final_Cased, but the
specification for Final_Cased (section 3.13) is not identical to that
of Final_Sigma (UAX 21, superseded).
regexp Final_Cased:
Before C[{cased=true}][{wordBoundary!=true}]*
After C !([{wordBoundary!=true}]*[{cased=true}])
regexp Final_Sigma:
Before Ccased case-ignorable*
After C !(case-ignorable* cased)
Is the old specification of Final_Sigma still valid for determining
the final sigma casing context, or are there situations where it
is inadequate? What I mean is are these specification actually
the same WRT final sigma?
Theo



base character

2004-06-10 Thread Theo Veenker
According to the definition a base character is:
  A character that does not graphically combine with
  preceding characters, and that is neither a control
  nor a format character.
What is this expressed in terms of properties?
Something like this? cc==0 AND GG!=Cc AND GC!=Cf AND GC!=Cn
Theo



A binary file format for storing character properties

2004-05-04 Thread Theo Veenker
At this time there are about 160 different character properties defined
in the UCD. In practice most applications probably only use a limited set
of properties to work with. Nevertheless applications should be able to
lookup all the properties of a code point. Compiling-in lookup tables for
all defined properties (including Unihan) makes small applications become
rather big. This made me decide to create a binary file format for storing
character properties and initialize property lookup tables on demand.
Benefits of using run-time loadable lookup tables initialized from binary
files are:
  - no worries about total table size, since data will only be loaded
on demand
  - initializing lookup tables from a binary file is relatively fast
  - property lookup files can be locale specific (useful for character
names and case mappings for example)
  - new properties can be added quickly and never affect layout or
content of other tables
  - any number of properties can be supported including custom
(non-Unicode) properties
  - by initializing a lookup table from two sources (UCD-based and
vendor-based), applications can overload the default property
values assigned to PUA characters with private property values
The file format I've implemented is capable of storing any type of property.
Each file contains property values for one property (no more squeezing as
much property values as possible in as few bits as possible). The format
is called UPR (Unicode PRoperties).
I have written a tool to generate the necessary UPR files from the UCD. A
small C-library for reading a UPR file into a property lookup table, and
a high-level library which provides property lookup functions for *all*
Unicode properties in 4.0.0 are also available.
For more information on the file format and related software see:
http://www.let.uu.nl/~Theo.Veenker/personal/projects/upr/. My primary
development platform is UNIX/Linux, but you can compile and run it under
Windows as well (less tested however). Current version supports UCD 4.0.0,
I will add support for 4.0.1 soon.
Please check it out. Feedback is welcome.
Regards,
Theo Veenker



Re: Suggestion: use of symbolic links in the FTP site

2004-04-22 Thread Theo Veenker
Tom Emerson wrote:
Philippe Verdy writes:

Symbolic links is a bad idea on FTP. They are resolved by the client...


Really? Depends on your server: proftpd handles them fine.


I think it would also give the false feeling that a new 4.01 file
exist when in fact it's the same as 4.00.


No, the filename of the link would match that of the file in the 4.00
directory. The link only provides you with a way to access the
datafile. I would not recommend renaming the link to 4.01 if it didn't change.

I better suggest listing files with a simple parsable fiel which
lists all files that are part of a Unicode version, with their
respective version.


The problem here is that you cannot easily grab all the files for a
given release without first downloading that file and parsing the
information out and constructing pathnames.
It would not be difficult to write script which automatically downloads
the required files (using wget) for a given unicode version if there was
such a file as suggested by Philipppe.
Theo




Re: Downloading UCD 4.0.0

2004-04-20 Thread Theo Veenker
Asmus Freytag wrote:
At 08:42 AM 4/19/2004, Theo Veenker wrote:

Hi,

Until now I always downloaded the lastest version of the UCD
and worked with that. Now I want to download the UCD files for
4.0.0 again. I know it is all in http://www.unicode.org/Public/-
4.0-Update/, but in http://www.unicode.org/ucd/ I read this:
  The complete set of all files for a given version of the UCD
   consists of the files in the update directory for that version,
   together with all the files unchanged from earlier versions,
   which are kept in their respective update directories.
Do I really need to find out and download all unchanged files
from 3.2.0 and earlier, just to get the files for 4.0.0?


Yes.
:-(

And depending on what version of the UCD you are trying to piece 
together you may need potentially versions of some files from several 
earlier updates.

A./

PS: we are looking into ways to make access to older versions
more straightforward.
Please do! In the current situation if you have a UCD parser which
only handles a particular Unicode version, the set of required input
files for this parser is more or less taken away when a new Unicode
version is released. This makes it difficult to share such software.
It is good to gently push developers to use the latest release, but
making it hard to use an earlier release is IMO counter-productive.
Theo




Downloading UCD 4.0.0

2004-04-19 Thread Theo Veenker
Hi,

Until now I always downloaded the lastest version of the UCD
and worked with that. Now I want to download the UCD files for
4.0.0 again. I know it is all in http://www.unicode.org/Public/-
4.0-Update/, but in http://www.unicode.org/ucd/ I read this:
  The complete set of all files for a given version of the UCD
   consists of the files in the update directory for that version,
   together with all the files unchanged from earlier versions,
   which are kept in their respective update directories.
Do I really need to find out and download all unchanged files
from 3.2.0 and earlier, just to get the files for 4.0.0?
Theo




Re: ISO 8859-11 (Thai) cross-mapping table

2002-10-09 Thread Theo Veenker

Marco Cimarosti wrote:
 
 John Aurelio Cowan wrote:)
  Marco Cimarosti scripsit:
   Talking about the format of mapping tables, I always
   wondered why not using ranges. In the case of ISO
   8859-11, the table would become as compact as
   three lines:
 
  Well, that wins for 8859-1 and 8859-11 and ISCII-88, where Unicode
  copied existing layouts precisely.  But it wouldn't help other 8859-x
  much if at all,
 
 All 8859 tables would be more succint.
 
 Non-Latin sections use contiguous ranges of letters in alphabetical order
 or, however, in the same order used by Unicode; this is also true for most
 other non-ISO charsets.
 
 Latin sections are a worse case, but they still benefit slightly, because
 characters shared with Latin-in stay the same positions.
 
  and it requires binary search rather than direct
  array access, which would be a terrible lossage in CJK, where the
  real costs are.
 
 I agree. In the case of CJK it simply doesn't pay.

If I may add my two cents; IMO using search algorithms to reduce table size
doesn't pay in any case. If one uses fast one/two-stage lookup tables for 
both mappings (legacy to unicode and v.v.) then most tables require about 
3 kb or less of storage space. Approx. times 10..30 for CJK encodings. 
Compared to the 256 Mb in a typical PC each lookup table would consume 0.001%
(or 0.01-0.03% for CJK) of main memory. My point is it is better to concentrate
on processing speed than on table foot print. 

Theo




\p{} and \g{} in regexp

2002-07-23 Thread Theo Veenker

Hi,

I have a few questions regarding unicode regular expressions.

1)  I'm working on a regexp matcher and I'd like to know which properties 
are never needed in a \p{...} item. Currently I have included the properties
listed below, but for efficiency reasons I'd like to trough out what isn't 
really necessary:

  general category
  bidi class?
  canonical combining class ?
  decomposition type
  line break
  east asian width
  arabic joining type   ?
  arabic joining group  ?
  script name
  block name
  age
  numeric type
  all binary properties

So can anyone tell me if the marked properties are really usefull in 
a \p{...} item?


2)  About grapheme clusters in a bracketed expression. It is clear what is
meant by an expression like [a-z\g{aa}]. But how do I interprete something
like [a-z\g{aa}  \p{foo}]. This reads as: accept any character in range 
a-z or grapheme cluster aa, provided it has the foo property. The problem
is that \p{...} only applies to single code points, not to grapheme clusters.

I can do three things:
  1. try if NFC of characters in \g{...} yields a single character and
 work with that, otherwise fail
  2. only test first (base) character of the cluster
  3. don't allow use of operators  and - (i.e. ^) in a bracketed 
 expression in which one or more \g{...} are used

What would be the most appropiate thing to do?

Regards,
Theo




Re: unidata is big

2002-04-24 Thread Theo Veenker

andreas palsson wrote:
 
 Hi.
 
 I would just like to know if someone could give me a tip on how to
 structure all the unicode-information in memory?
 
 All the UNIDATA does contain quite a bit of information and I can't see
 any obvious method of which is memory-efficient and gives fast access.

You might want to evaluate some of the open source libraries 
mentioned under Enabled Products on the unicode site. For my
own lib (http://www.let.uu.nl/~Theo.Veenker/personal/projects/ucp/)
I've created a seperate table builder tool for each property or 
mapping. The tools organize data in planes, and for each plane
all possible trie setups are determined (about 80 combinations
of one, two or three stage tables). Then the cheapest setup
is used. This still requires over 230kb to store all data 
(except character names and comments) from the following files:
UnicodeData.txt, EastAsianWidth.txt, LineBreak.txt, ArabicShaping.txt,
Scripts.txt, Blocks.txt, SpecialCasing.txt, CaseFolding.txt,
BidiMirroring.txt, PropList.txt, DerivedCoreProperties.txt,
DerivedNormalizationProperties.txt, and DerivedJoiningType.txt.
For some mappings I've stored 32 bit code points where 16 bit
would have been enough, but I decided API uniformness is more
important than memory efficiency. 

I wouldn't bother too much about memory efficiency; it's irrelevant
these days. Even your mobile phone has enough memory to store all 
unicode data 10..20 times. Same thing for lookup speed. All you have
to do to get it fast is to wait (a few seasons).

Theo




Re: Whence UniData.txt? (was Re: unidata is big)

2002-04-24 Thread Theo Veenker

[EMAIL PROTECTED] wrote:
 
 Theo's comment leads me to a question I've pondered recently:
 
 Assumptions:
 
Many apps, from independent sources, need to access the Unicode
character data,
 
A lot of these apps aren't overly concerned with the slight overhead of
parsing the data as needed from Unicode-supplied data files directly.
 
Similarly, such apps benefit from being able to easily upgrade to new
Unicode releases by simply replacing the data files.
 
It isn't very user-friendly to for every such app to store their own
private copy of the character data files when a single shared copy would
take up less space and be easier to maintain.
 
 It would seem to me that there is some value in establishing either (1) a
 standard location where programs can expect to find (or install) a local
 copy of the Unicode data files, or (2) a standard way to discover where
 such a local copy of these files exist. My preference would be (2), which
 would make it easy to configure a network of machines to share a single
 copy of the data files. Something as simple as an environment variable
 could work if developers were to agree on its name and semantics.

For applications that eat raw UCD files, this shouldn't be to difficult
to achieve. Any well designed app will/should have some parameter or env.
variable that you can set (no?). But for apps/libraries that like their UCD 
files cooked it is a different story because there is no recommended binary 
format for representing (compact) unicode character data. Personally I
would appreciate seeing such a recommendation including your point (2).
However apps/libs which enrich the character data with custom properties, 
would still need their own copy of the data.

The subject reminds me of the TZ database. Here you have a large text based 
database containing information on time zones and daylight saving times.
You can compile the data into a binary format by running a utility included
with the tz sources. Well, they don't give any recommendation on where to 
store the (text and/or binary) data, but at least there is a 'standard' 
format, which allows for sharing data. Would be nice to have something like
this for the UCD.

 (I understand there may be different mechanisms for different platforms,
 but it would be even better if a standard mechanism were cross platform).
 
 So, are there any conventions for this evolving?  Or would anyone like to
 propose one?

Please, go ahead :o)

Theo




grapheme length

2002-04-18 Thread Theo Veenker

Hi,

I'd like to know if there is something like a longest
grapheme length. From the UTR-18 I see there is no limit,
but in practice, can someone give an estimate of howmany
code points a longest grapheme would occupy, roughly? 

Theo




UCD 3.2.0

2002-04-04 Thread Theo Veenker

Hi all,

I'd like to make a few remarks about the UCD files.

The following things I ran into when checking out the 3.2.0 release:

 o  In PropertyValueAliases-3.2.0.txt line 79:
ccc; 202; ATBL ; Attached_Below_Left
whereas in UnicodeData-3.2.0.html I read:
200: Below left attached
202: Below attached
What is is correct value for attached below left, 200 or 202?

 o  In SpecialCasing-3.2.0.txt lines 234 and 235 are missing the closing
semicolon. This problem also appeared in 3.1.1.

 o  Typo in UnicodeCharacterDatabase-3.2.0.html:
DerivedNormalizationProperties, should be DerivedNormalizationProps.

Minor points that I find a bit annoying:

 o  Many of the UCD files have a comment header with lines longer than 80
characters. Viewing these files using the page utility on a 80 column
terminal window to gives ugly output due to the forced line wrapping.

 o  All UCD files except CaseFolding-3.2.0.txt and SpecialCasing-3.2.0.txt
*separate* columns by semicolons. For the two exceptions the semicolon 
*terminates* a column, why not keep it the same for all UCD files?

 o  UnicodeData-3.2.0.txt still uses this notation:
1234;Blah, First;Lo;0;L;N;
5678;Blah, Last;Lo;0;L;N;
instead of 
1234..5678;Blah, First..Blah, Last;Lo;0;L;N;
Since all other UCD files use the latter notation why not change this
one too? IMHO backward compatibility with existing UCD file parsers 
shouldn't be an issue in this particular case.

Regards,
Theo




mnemonic input

2002-03-27 Thread Theo Veenker

Hi all,

Suppose I want to enable mnemonic input in my software. Using 
mnemonics allows one to write e' (of course embedded in some 
escape sequence) instead of \u00e9 or eacute; Which sets of 
mnemonics are being used or should I use? I found the ISO-10646 
charmap file which gives mnemonic.ds symbolnames, but I can 
find no further references. I suppose only a minimal set of 
mnemonics is really useful, because to me it seems almost 
impossible to remember them all even though there's logic 
behind the construction of mnemonics. 

My questions are: 
- Which mnemonic sets are available and actually used by people?
- Are there any recommended escape sequences to enter a mnemonic
  (like name; does for SGML character entity names)?
- Does it make sense to use mnemonics for ideographic scripts? 
- Is support for input of full unicode character names (as I
  believe is supported in perl) considered useful?

Thanks,

Theo




Re: mnemonic input

2002-03-27 Thread Theo Veenker

Marco Cimarosti wrote:
 
 Ooops!
 
 Of course, I was replying to a different question: Does it make sense to
 use mnemonics for ideographic scripts?

I hadn't even noticed you quoted the wrong question, but I understood
it anyway. Wat I meant was; I can use mnemonic characters in a plain
ASCII text file (instead of or in addition to writing \u), but
could I also include Japanese this way (assuming a knew any Japanese),
without using an input method to compose the characters? From RCF-1345
referred to by Keld Simonsen I see I can (as least for most BMP stuff).

Many thanks for the information.

Regards,
Theo




UCD 3.2.0

2002-03-11 Thread Theo Veenker

Dear list,

I don't know if this is the right place to ask this question, but 
I'll ask it anyway. I'm updating my software for the UCD 3.2.0 (beta)
files, but in file ArabicShaping-5d2.txt at line 202 I read

0718; WAW; R; SYRIAC WAW

where SYRIAC WAW represents the joining group.
Is this correct, or should it just read WAW?

Another question, when can I expect a new edition of the unicode book?

Regards,
Theo Veenker




Writing/finding a UTF8, UTF16, UTF32 converter

2001-12-02 Thread Theo

Hi UniCode list,

I am dealing with unicode for XML. I'm sorry if this bothers a few
people, but reading the technical information is not very easy. The
crossings out and underlinings don't help, the information seems a bit
scattered, and the usually interesting information is not linked to in
easy to find places.

I think I have finally found what I wanted, the table:

Table 3.1. UTF-8 Bit Distribution

on http://www.unicode.org/unicode/reports/tr27/

Basically, I want to write some code that can convert UTF8, UTF16, and
UTF32 to any of the other two formats. I suppose I could use UTF32 as a
go-between to reduce the conversion possibilities.

Anyhow, does anyone know of any existing source code that does this
transformation?

I don't feel like using Apple's UniCode converter because it seems so
complex it will probably take MORE work for me to access it, than just
write the conversion code myself. And even then I hear it doesn't do
UTF32, so there is no use. And even then I have to compile my code for
Win32 also, so its even more no use.

If anyone knows of some existing code that does the transformation,
that would help. I might end up re-writing it myself and just use the
code as a working example.

All that bitshifting and bitmasking such should slow down my UTF8/UTF16
processing, is there any accepted good way to speed this up? Some form
of table perhaps?

--
This email was probably cleaned with Email Cleaner, by:
Theodore H. Smith - Macintosh Consultant / Contractor.
My website: www.elfdata.com/





Unicode string functions

2001-08-06 Thread Theo

Hi people,

this is my first post to this list.

I want to write many string functions for dealing with unicode. I am
writing both an XML engine, and an XML editor (Which I have
imaginatively called XML Engine and XML Editor). Now, of course XML
1.0 specifies that I need to deal with Unicode, which my engine doesn't
currently do.

I've written a lot of string functions before, most of my programs deal
with data in fact, especially string data. I find dealing with strings
quite fun for some reason, and I've written a plugin for my language
(REALbasic) that speeds up many things to do with strings, and adds
loads of new string features to REALbasic.

Anyhow. I find this UniCode system rather confusing at first. I hope it
will be easy to deal with, but I am not sure.

Please correct me on the basic facts of Unicode here. I am just
repeating them in order to see if anyone tells me I am wrong :o). Ok:

From what I can tell, UniCode is simply a numbering system, and
encoding system for these numbers (a very crude summary, but it is
enough for me, for now). Unicode allows for compression, in UTF8, and
UTF16, so that you can fit lower ascii into 1 byte, while other numbers
that aren't lower ascii, may need many more bytes. UTF32 represents the
numbers as is, no compression needed.

Now, I have heard that dealing with UTF8/UTF16 is a real nuisance,
because say, to extract characters 6 though 8, of a string, you need to
start at the beginning, then loop the entire way through, detecting
which bytes are multiple bytes.

I'd rather avoid dealing with UTF compression schemes, for both RAM
sake (more code), speed sake (more to do), and bugs sake (complex code
is buggier code).

OK, so I have an easy solution. My language REALbasic, has a
TextConverter object, that lets you convert from a huge range of text
encoding schemes! This feature is provided by Apple and MicroSoft, and
REALbasic simply gives you access to it. So, I can convert from
UTF8/UTF16 to UTF32, and use some string functions I will write, that
specifically deal with UTF32.

OK, so in short:

If I am writing some string functions for UTF32 (with XML in mind), are
there stumbling blocks I may come across? I know almost nothing about
Unicode, can I just treat each char as a 4 byte character, and not care
about any other UniCode special features? If there are special features
(0 width characters, etc), what do I have to do about them?

The string functions I'll need are:

*Character set functions (searches through a string, based upon if each
character of this string is inside a character set, and returns the
position of the first character found that is, or isn't in the
character set.)

*String searching functions (searches through one string, for another)

*Character getting functions (Get one character at a time)

*string extraction functions (get a segment of one string, and create a
new string that is a copy of it)

*string creation functions (create a string, that is just a number of
characters long, each of the same character value)

And then of course, I'll have to reconvert back to UTF8/UTF16 on
saving! Quite a bit of work, but nothing compared to what I have done
so far.

Is it possible I forgoe all characters that can't be expressed in UTF16
without taking more than 2 bytes? That way, I can halve the RAM needed
for large XML documents. I'd rather not code two string function
versions, one is enough. So it is 2 bytes, or 4 bytes. I read on
unicode.org that all characters of almost all languages can be
contained in 2 bytes.

What languages need 4 bytes or more to describe? If they are some
unheard of tribe with 100 living members, perhaps I can just forgo
them?

--
This email was probably cleaned with Email Cleaner, by:
Theodore H. Smith - Macintosh Consultant / Contractor.
My website: www.elfdata.com/