Re: Unicode thoughts...

Jeff Mon, 25 Mar 2002 19:26:37 -0800

Hong Zhang wrote:
> 
> I think it will be relative easy to deal with different compiler
> and different operating system. However, ICU does contain some
> C++ code. It will make life much harder, since current Parrot
> only assume ANSI C (even a subset of it).
> 
> Hong
> 
> > This is rather concerning to me.  As I understand it, one of
> > the goals for
> > parrot was to be able to have a usable subset of it which is totally
> > platform-neutral (pure ANSI C).   If we start to depend too much on
> > another library which may not share that goal, we could have trouble
> > with the parrot build process (which was supposed to be
> > shipped as parrot bytecode)


I guess it's obvious that I hadn't looked at the target platforms for
ICU as closely as I probably should have. C vs. C++ doesn't concern me,
as it can always be rewritten, but lack of platforms like OS X does.
Given that, I think an interim solution consisting of basic Unicode
utilities we'll need, such as Unicode_isdigit(). This can be a simple
wrapper around isdigit() for the moment, until I sort out which files we
need from the Unicode database, and what support functions/data
structures will be required.

Given that we're dedicated to either UTF-16 or UTF-32 for internal
string representation (undecided as of yet, and isn't affected by this),
we can get away with creating a simple unicode.{c.h} suite of functions
that looks like:

Parrot_Int Parrot_isDigit(char* glyph);

We can get away with the simplicity here because the character array
should already be a valid UTF-{16,32) string, and responsibility for
making sure there's a valid glyph at that offset can be safely offloaded
to the caller, if not higher up the calling chain. Also, it should be in
a separate file because, assuming the final internal representation
matches that of the RE engine, the engine can use these utilities as
well.

Now, admittedly this is only slightly better-thought-out than the
origina proposal, but I think it has a much better chance of being
implemented, and in a fairly short amount of time. (He said, knowing
full well that there's always one more problem) ASCII versions of the
functions should be almost trivial, and can be left in there as a
compile-time switch should we choose to do an ASCII-only or UTF-8-only
version.

In conclusion, this approach feels more workable, and the full UTF-16
implementation details can be rolled out incrementally, rather than a
single mass migration. If this suggestion flies, I'll rewrite
strings.pdd and post it in the next few days.
--
Jeff <[EMAIL PROTECTED]>

Re: Unicode thoughts...

Reply via email to