Re: Unicode step by step

Leopold Toetsch Sun, 11 Apr 2004 02:39:35 -0700

Jeff Clites <[EMAIL PROTECTED]> wrote:
> On Apr 10, 2004, at 6:13 AM, Leopold Toetsch wrote:


>> 2) String PBC layout. The internal string type has changed. This
>> currently breaks native_pbc tests (that have strings) as well as some
>> "parrot xx.pbc" tests related to strings.

> These are working for me (which tests are failing for you?)--

$ make testr
Failed Test        Stat Wstat Total Fail  Failed  List of Failed
-------------------------------------------------------------------------------
t/pmc/perlstring.t    3   768    33    3   9.09%  1-3
53 subtests skipped.
Failed 1/89 test scripts, 98.88% okay. 3/1432 subtests failed, 99.79% okay.

I didn't look further yet.

> ... Of
> course, since the internals changed the pbc layout changed also, so the
> native_pbc test files need to be regenerated on the various
> platforms

No problem.

> But, it's correct that there's no backward-compatibility code in place,
> to allow reading old pbc files. Do we want to have that sort of thing
> at this stage?

No, not needed.

>> The layout seems to depend somehow on the supported Unicode levels (or
>> not). So before fixing the PBC issues, I'd just have a statememt:
>> parrot_string_t looks such and such or of course as is now.

> Could you rephrase? I'm not understanding what you are saying.

Well, the question is: Is s->representation enough to describe our
strings?

> ... The only other wrinkle is that for cases where
> s->representation is 2 or 4, we need to endianness correct when we use
> the bytecode.

Yep.

> This is probably a separate discussion, but we _could_ decide instead
> to represent strings in pbc files always in UTF-8. Advantage: Simpler,
> no endianness correction needed, probably durable to further changes in
> string internals, could isolate s->representation awareness to string.c
> and string_primitives.c. Disadvantages: De-serializing a string from a
> pbc file will always involve a copy, and could result in larger files
> in some cases. I could argue it either way--one's cleaner, the other is
> probably faster.

Strings from PBC constants can't be used directly anyway. We munmap() or
free() the image after loading, so string constants are always copied. I
think using UTF-8 would be best.

>> There is of course still the question: Should we really have ICU in
>> the tree. This needs tracking updates and patching (again) to make it
>> build and so on.

> One consideration is that I may need to patch ICU a few places--there's
> at least one API which they only expose in C++, so I need to wrap it in
> C and it's cleaner to do that as a patch to ICU rather than having C++
> code in the core of parrot.

Can we get the ICU maintainers to integrated that interface?

> JEff

leo

Re: Unicode step by step

Reply via email to