Tying & Overloading

branden-perl6 Fri, 20 Apr 2001 09:38:58 -0700


Hello,

First of all, I'm sorry, this text is rather too long.

I worked a bit too long on this, mainly because I wanted to give a big
picture of it, without omitting important details. The big picture,
because I wanted to show why would we bother having different ways of
handling for tying and overloading. The details, because I would like to
show that it's (probably) feasible and (maybe) useful.

I hope you like it.




Internal Design of Tie & Overload
---------------------------------


Fact: The way Perl 5 handles tying and overloading sucks. It's slow,
it's not flexible, it's full of special cases. Tying is a hard thing to
do in C, using XS, and overloading isn't even documented.

Thinking about that, I begun to think on a way to do it more flexible
and efficient. I'll talk about tying in Perl 5, because I really don't
know how overloading is done, but I'll propose a way to define
overloading too.

What's so bad about tying in Perl 5? I guess it's the fact that it's a
special case, instead of the general case. Normal variables are set by a
call to sv_set*. For tied variables, a call to SvSETMAGIC has to be done
to assure the functions of the underlying tied object are called. The
tied objects are stored in a table, and SvSETMAGIC has to traverse the
table searching for entries to define which function should be called.

A simpler mechanism can be thought, if we consider that a non-tied
variable is just a special case of a tied one, a case where the STORE,
FETCH and the other functions just write or read the value of the data
field for the variable.

This is like inheritance in a OO language, like C++. For example, if I
were to implement something (a variable) that can have different
behaviours, depending on the type of the object (tied, not tied, there
are others), and I were to implement it in C++, I would create an
abstract class, with pure virtual methods, defining the interface for
the object (the interface for the variable). Then, I would define some
subclasses for this abstract class (the implementations) that would
define the behaviour on different cases.

Basically, for the specific case of a scalar variable, I would define an
abstract class with the methods STORE, FETCH and DESTROY, as pure
virtual methods. This is actually all a scalar variable can do: store a
value, give it to me, and destroy the value when it's disposed. Then I
would create one subclass, for non-tied variables, that would have a
member attribute variable to hold the value of the variable. Then I
would write the STORE and FETCH methods to write and read the member
attribute variable. The DESTROY would probably do nothing, considering
GC is being used.

If I needed, for example, a tied scalar that read the contents of some
file, I would then create a subclass of my base abstract class, and
implement the STORE and FETCH methods to write and read the whole file
and convert the data to a format Perl understands. DESTROY would
possibly close the file, or something like that.

Then, a routine that would use a scalar variable could have it cast to
the base abstract class, and call the STORE and FETCH methods, knowing
that it was implemented by a derived class to do the right thing,
independent of which base class it is.

This is not something unique to C++, or to OO languages. It can also be
done in C, using virtual dispatch tables (vtables). Virtual dispatch
tables are structs that have as fields pointers to functions. The
implementations fill one of this struct with its version of the methods.
The routine that uses the variable calls the methods by dereferencing
the pointers in the base class.

Is it any better than what it's done now? Well, I think yes, since
instead of having to always call a SvSETMAGIC function that traverses a
somewhat expensive table to find which method to call, even *after*
having run the code for sv_set*, we would be sure that the sv_set* code
would do the right thing directly, at the cost of dereferencing a
pointer and calling a function pointed by that pointer, which is
probably very little.

Now, the main idea is to convert  SV, AV, HV, and the other similar
internal structures of Perl to something that supports virtual dispatch
tables (what will be called a PMC, or Perl Magic Cookie). The essential
of the PMC is that it has a `data' field, which is a pointer, that
points to the internal structure of the variable (or object), and a
`vtable' field that points to a table that implements the methods
required by that variable (or object). One vtable can be shared among
many variables, for example, all non-tied scalar variables point to the
same vtable as do all scalar variables that read and write data to the
file, as in the example above. The data field, OTOH, should be private
to the variable, so that the vtable methods can act on one variable and
not on all of the same kind.


---


When thinking about the interface needed by the similar of SV and AV,
something got me thinking. It all was working well, but I saw problems
when I tried to see what would happen with the $a = $b instruction.

In Perl 5, SV is used for both the variable and the value. For example,
the stack is made by SVs. An AV is an array of SVs. The SV actually
keeps the string, int and other values of the variable. If I'd want to
keep this, and I would want to implement both tying and overloading with
virtual dispatch tables, I would need a vtable for SVs that had entries
for STORE, FETCH and also for PLUS, MINUS, etc. Now, consider $b is a
non-tied variable, i.e., it's STORE, FETCH, etc. are the ones for a
variable that just stores the value, but the value of $b is an
overloaded object, i.e., its PLUS, MINUS, etc. have a special
definition. Consider $a to be a tied variable, i.e., one that have STORE
and FETCH specially defined, but its value is not overloaded, i.e.,
PLUS, MINUS has no special meaning. What would happen when

    $a = $b;

??? If I copy the vtable of $b to $a, $a will not be tied anymore. If I
don't, the value copied won't carry its overloaded behaviour.

The thing is: a variable is tied, a value is overloaded. They're
different things. They should be separated. A variable's vtable should
have STORE and FETCH methods. A value's vtable should have PLUS, MINUS,
etc., and all related to overloading. The equivalent of an AV should
store an array of values rather than variables. The stack should be
implemented with values rather than variables. Temporaries should also
be values. The thing that makes them seem like the same is that a value
is quite the same as a non-tied scalar variable, but the difference is
that the scalar variable can be tied, when the value cannot.

That's why I found the conclusion that variables and values are two
separate things and shall be treated as such.


---


I tried to identify the objects that exist in Perl 5, to try to
find the methods they implement. Well, three obvious things that are
objects of Perl 5 are the variables: scalars, arrays and hashes. The
methods they should implement are also easy ones, because they're
pratically what are defined when tying a variable of each type. In
Perl's guts, they're represented by SV, AV and HV. Values are another
object, already discussed, also corresponds to SV.

Another obvious example of object is a filehandle, that also can be
tied, so it's also easy to find which methods it should have by looking
at how it's done by a class that ties a handle.

Other object of Perl is the package. A package can be thought of a GV in
Perl's guts. It's responsible by keeping the namespace, giving a piece
of code or a variable (or even a glob) for each name. A piece of code is
another object. It should have methods for receiving parameters, and
being called in a context (scalar, list, void), and evaluating its
operations and returning some values. An unnamed sub or a reference to a
sub would be a code object, for example. A regexp is also an object. The
qr// operator of Perl 5 allows to store a regexp in a variable, so it
should be an object.

Having (for example) packages, code, and regexps as objects with a
virtual dispatch table have many advantages, in the sense that there
could be implementations that do what these should do a little
differently. For example, packages, that are responsable to translate a
name to a method or variable, could have a table that would do the
lookup differently of what it's done using the @ISA variable, leading to
different kinds of inheritance. Code, could have a table that implement
different ways of calling the sub, for example running pre- or post-
handlers before and after the code. A virtual table for a code could
also `proxy' the call to a method in another language, or another
machine, translating arguments and return values. A virtual table for a
regexp could extend the syntax or the semantics of Perl's regexp engine.


---


I tried to describe what I think the classes above should look like. Of
course I didn't go all the way of doing all of them, since I probably
have many mistakes here, and probably many can help doing it better and
right. After all, that's why Perl 6 is being discussed in this list,
right?


1. The value

I will call ``value'' what is usually called an object. A constant is a
value, a variable holds a value, an array holds various independent
values, etc.

A value would correspond to Perl5's SV, with the difference that in
Perl5, SV is also used as a scalar variable. This has some
disadvantages, that I'll point later.

>From an OO view, a value should know how to operate with other values,
to create new values. For example, a number value should know how to add
itself to another number value, generating yet another number value. So
as a string value should know how to concatenate itself to another
string. Etc.

This leads to a hook on overloading. An overloaded object is simply a
value that has some ``custom'' functions to indicate how the operations
it can do are supposed to behave. Having an overloaded object would mean
having a value with a customized vtable that would indicate how the
operations to add or concatenate would be done.

In Perl, a value can have more than one kind of value. For example, a
value can be a numeral, a string, a reference, a blessed object, etc. It
even can be more than one of these at a time. (For example, $! can have
one value as a number and another as a string. c.l.p.m has examples on
how a variable can be a reference to an array and a reference to a hash
at the same time. See also perlguts for SvIV_ok). This should be
generalized, and any value should have to ``return'' its number value,
its string value, its array-reference value, and so on.


1.1. Value assignment and destruction

Well, I thought, I thought more, I thought again, I changed my mind
several times, and I see I can't get to a conclusion. I see three ways
of implementing values. One is to do have all values immutable, i.e.,
when you operate two values, the result is stored in a new allocated
value. Other is to do it as a general case, but have a specific case for
`simple values', that should be handled efficiently, like strings, small
numbers, etc, and should be modified in-place. The other is to allow
each vtable decide if the value will be immutable or if it will be
modified in-place, which I don't know if would work. I'll explain the
general idea and then talk about this.



Basically, every value is immutable (except for the exceptions, of
course). If you want to perform an operation with the value, the
operation should return another new value. It's easy to use values this
way, because you can share values between variables. For example, you
can have two variables sharing the same value, and you don't need to
worry when you increment one of them, you know the other will continue
with its old (immutable) value. (Actually, this can be quite consuming
if you are, e.g., incrementing a counter on a loop.)

Destruction is an important topic. A value should have a hook to allow
custom finalization for certain objects. This is actually essential to
allowing foreign objects inside of Perl.


1.2. The vtable

I'll now present the vtable (with the methods) for a value. It actually
depends on various definitions of variables, kinds of values, etc., but
it actually can be understood without these definitions.

   struct sval_vtable {
      int     TYPE;
      // find out which representations are available
      int   (*REPRS)      (SVAL *this);
      // get different views of the value
      void  (*NUMBER)     (NUM *result, SVAL *this);
      void  (*STRING)     (STR *result, SVAL *this);
      int   (*BOOLEAN)    (SVAL *this);
      // if the value is a reference
      void  (*SCALARREF)  (SVAR *result, SVAL *this);
      void  (*ARRAYREF)   (AVAR *result, SVAL *this);
      void  (*HASHREF)    (HVAR *result, SVAL *this);
      void  (*FHANDLEREF) (FHANDLE *result, SVAL *this);
      void  (*CODEREF)    (CODE *result, SVAL *this);
      // object stuff
      void  (*OBJECT)     (PACKAGE *result, SVAL *this);
      // numeric operations
      void  (*PLUS)       (SVAL *result, SVAL *this, SVAL *value);
      void  (*MINUS)      (SVAL *result, SVAL *this, SVAL *value);
      void  (*TIMES)      (SVAL *result, SVAL *this, SVAL *value);
      void  (*DIVIDE)     (SVAL *result, SVAL *this, SVAL *value);
      void  (*MODULUS)    (SVAL *result, SVAL *this, SVAL *value);
      // numeric ++ / --
      void  (*INC)        (SVAL *result, SVAL *this);
      void  (*DEC)        (SVAL *result, SVAL *this);
      // string operations
      void  (*CONCAT)     (SVAL *result, SVAL *this, SVAL *value);
      void  (*REPEAT)     (SVAL *result, SVAL *this, SVAL *value);
      // bit operations
      void  (*BITAND)     (SVAL *result, SVAL *this, SVAL *value);
      void  (*BITOR)      (SVAL *result, SVAL *this, SVAL *value);
      void  (*BITXOR)     (SVAL *result, SVAL *this, SVAL *value);
      void  (*BITNOT)     (SVAL *result, SVAL *this);
      void  (*BITSHL)     (SVAL *result, SVAL *this, SVAL *value);
      void  (*BITSHR)     (SVAL *result, SVAL *this, SVAL *value);
      // numeric comparisons
      int   (*NUMCMP)     (SVAL *this, SVAL *value);
      int   (*NUMEQ)      (SVAL *this, SVAL *value);
      int   (*NUMNE)      (SVAL *this, SVAL *value);
      int   (*NUMLT)      (SVAL *this, SVAL *value);
      int   (*NUMGT)      (SVAL *this, SVAL *value);
      int   (*NUMLE)      (SVAL *this, SVAL *value);
      int   (*NUMGE)      (SVAL *this, SVAL *value);
      // string comparisons
      int   (*STRCMP)     (SVAL *this, SVAL *value);
      int   (*STREQ)      (SVAL *this, SVAL *value);
      int   (*STRNE)      (SVAL *this, SVAL *value);
      int   (*STRLT)      (SVAL *this, SVAL *value);
      int   (*STRGT)      (SVAL *this, SVAL *value);
      int   (*STRLE)      (SVAL *this, SVAL *value);
      int   (*STRGE)      (SVAL *this, SVAL *value);
      // destructor
      void  (*DESTROY)    (SVAL *this);
   };

SVAL is a PMC (Perl magic cookie) has a sval_vtable as its vtable. The
prototype for SVAL is:

   struct SVAL {
      sval_vtable   *vtable;
      void          *data;
      int            flags;
      void          *thr_data;
      void          *gc_data;
   };

Other PMC's are similar (equal) to this, except for the type of the
vtable pointer. SVAR holds for a scalar variable, AVAR for an array
variable, HVAR for a hash, FHANDLE for a file handle, CODE for a piece
of code (a reference to a sub, or an inlined sub), and PACKAGE refers to
an object class, that can be used to call methods over an object. NUM
gives access to the number representation of the value, that can be an
integer, a float, or a big integer or big float. STR gives access to the
string representation of the value, that can be in an arbitrary
encoding.

I guess most of the vtable entries are self-explanatory, on what they
do. The major exceptions are INC and DEC, that return a new value (that
would be the incremented/decremented value of the current object). This
is necessary to support immutable values, that cannot be incremented
without creating a new value.

REPRS returns a bitmap indicating which of the representations (string,
number, scalar reference, array reference, object, ...) are available.

The TYPE field should be an unique integer that represents the vtable.
It should encode the vtable type (sval_vtable, in this case), so that
you can query what kind of object is a PMC. The TYPE could be used to
serialize a value. The unique identifier of the vtable would be used, so
that when de-serializing, the vtable whose de-serializing method should
be used can be found.


1.3. Variations on a theme

Well, the model above would be the one where every value cannot be
changed once allocated. This can be bad, for example, if the .= operator
is being used on a variable inside a loop, or something like that. To
cope with these cases, I thought of having a `special case'.

The special case would be storing a known struct in the data field of
the PMC for values that have a simple structure. With values that have a
simple structure, I would mean numbers, both integer and floats, but not
bignumbers, and strings, encoded in one of a maximum of three basic
encodings (probably one of 1-byte characters, other with 2-byte
characters and other with 4-byte characters). This kind of value would
be tagged with a NULL vtable field, or better, an actually working
vtable but with a special TYPE field, like 0. There would be macros to
access all values' operations. For example, there would be a sval_plus
macro that would be defined as

    if (this->vtable->TYPE) {
        this->vtable->PLUS(result, this, value);
    } else {
        result = this->data->int_value + value->vtable->INTEGER(...)
        ???
        handle the special case.
    }

There would also be a sval_plus_assign, that would be used on the +=
case, that would be done as

    if (this->vtable->TYPE) {
        this->vtable->PLUS(this, this, value);
    } else {
        this->data->int_value += value->vtable->INTEGER(...)
        ???
        handle the special case.
    }

sval_assign would be defined as

    if (this->vtable->TYPE) {
        // value is immutable
        result->vtable = this->vtable;
        result->data   = this->data;
    } else {
        // deep copy the data
        result->vtable = this->vtable;
        result->data   = malloc(...); // or get from a pool
        result->data->int_value = this->data->int_value;
        result->data->str_value = strdup(this->data->str_value);
        ...
        handle the special case.
    }

All access to a value would have to be done through the macros.

The other way I see for it is to allow each vtable decide if it would
share immutable values between variables or copy the values on
assignment. For this, it would be necessary to add an ASSIGN operation
to the vtable, like

      void  (*ASSIGN)     (SVAL *result, SVAL *this);

that would handle the same case of the macro sval_assign above, i.e.,
would do a deep copy on the data if the value is supposed to be changed
with an operation.

Besides, operations like PLUS, MINUS, etc., would have to deal with the
case result == this, and treat it specially, reusing the pre-allocated
area for the value, instead of allocating a new one.

Both these cases, where reusing an allocated memory area for a value is
allowed, require that *every* implementation of PLUS, MINUS, ..., even
the ones that don't reuse memory, can deal with result == this, i.e.,
they don't write something into result->data and try to read this->data
later, because they may be the same. I don't really know if this is
feasible, but I think it really is.

I really don't know which of these three ways is better. Probably, only
implementing the three and making some benchmarks over some real Perl
(or Perl-like) code to test real world situations would give a better
hint on what is better. The idea on supporting macros (like sval_plus,
sval_assign) would be a good idea since then the underlying mechanism
could be changed, to a certain extent, without having to change the
interface so much.

Actually, I think having macros for other objects (for example,
svar_store, svar_fetch, avar_store, hvar_delete, fhandle_readline, ...)
is a good idea, since at least it cuts down the writing of
x->vtable->METHOD(w, x, y) to type_method(w, x, y), and x doesn't have
to be typed twice, and neither vtable.


2. The variables

There are (basically) three types of variables in Perl: scalars, arrays
and hashes. (Ok, there are also filehandles -- that will become objects,
and globs -- that probably will go away. Are there any other?).
Filehandles will also be described below.

The purpose of having vtables for variables are to provide a mechanism
for tying. The layout for the vtables is almost all defined by the
current (Perl5) tying mechanism.

The proposal for them is:

   struct svar_vtable {
      int     TYPE;
      void  (*FETCH)      (SVAL *result, SVAR *this);
      void  (*STORE)      (SVAR *this, SVAL *value);
      void  (*DESTROY)    (SVAR *this);
   };

   struct avar_vtable {
      int     TYPE;
      void  (*FETCH)      (SVAL *result, AVAR *this, int index);
      void  (*STORE)      (AVAR *this, int index, SVAL *value);
      int   (*FETCHSIZE)  (AVAR *this);
      void  (*STORESIZE)  (AVAR *this, int size);
      void  (*EXTEND)     (AVAR *this, int size);
      int   (*EXISTS)     (AVAR *this, int index);
      void  (*DELETE)     (AVAR *this, int index);
      void  (*SPLICE)     (AVAR *result, AVAR *this, int index,
                           int length, SVAL **list_of_values);
      void  (*DESTROY)    (AVAR *this);
   };

   struct hvar_vtable {
      int     TYPE;
      void  (*FETCH)      (SVAL *result, HVAR *this, SVAL *key);
      void  (*STORE)      (HVAR *this, SVAL *key, SVAL *value);
      int   (*EXISTS)     (HVAR *this, SVAL *key);
      void  (*DELETE)     (HVAR *this, SVAL *key);
      void  (*CLEAR)      (HVAR *this);
      void  (*FIRSTKEY)   (SVAL *result, HVAR *this);
      void  (*NEXTKEY)    (SVAL *result, HVAR *this, SVAL *lastkey);
      void  (*DESTROY)    (HVAR *this);
   };

   struct fhandle_vtable {
      int     TYPE;
      void  (*PRINT)      (FHANDLE *this, SVAL **list_of_values);
      void  (*PRINTF)     (FHANDLE *this, SVAL *format,
                           SVAL **list_of_values);
      void  (*READLINE)   (SVAL *result, FHANDLE *this);
      int   (*GETC)       (FHANDLE *this);
      int   (*WRITE)      (FHANDLE *this, SVAL *buffer,
                           int length, int offset);
      int   (*READ)       (FHANDLE *this, SVAL *buffer,
                           int length, int offset);
      void  (*CLOSE)      (FHANDLE *this);
      void  (*DESTROY)    (FHANDLE *this);
   };

I actually don't think any of these needs an explanation.


3. Code references

The vtable for code references would actually have a method for
evaluating the code, passing some parameters (a list of values) and
returning some of them. It's necessary to add a mechanism for telling
the code in which context (scalar, list, void) it's being called.

Here would be an appropriate place to put methods to inspect what a code
do. For example, there could be an interface to access the actual
bytecode produced by a method. Afterall, that's what the interpreter
needs to run a piece of code.


4. Objects

Objects (blessed thingies) in Perl are nothing more than a special
syntax to allow the use of -> to call methods on values. That's why an
object value should return a PACKAGE, whose vtable would do the method
dispatching logic.

The package vtable would have methods to find a method (that would have
the code vtable) by its name. This vtable would deal with @ISA (or any
other needed mechanism), so that it should be possible to have different
mechanisms of inheritance based on the package vtable.

The package vtable could also generate customized code vtables (by the
use of wrappers) for the methods, so that some special actions are
executed with the methods. For example, this could be used to add pre-
and post- handlers for methods of a class, by creating a subclass of the
package vtable for objects of that class.

The properties of the object (that are usually stored in a hash
reference) aren't actually needed here, since a value can be an object
and a hash reference at the same time. It could actually be an array
reference and a scalar reference as well, so as a filehandle, so that a
value can effectively be used to represent a (reference to a) typeglob.

What's missing? Strings and numbers. And regexp's, that are actually
related to strings (or to values???).


5. Numbers

The vtable for numbers should be so that it should be possible to take
the value of the number as an integer, a float or a bignumber. The
integer and float parts can be dealt easily by methods that return the
value as a simple C int or C float. The problem is dealing with bigints
and bigfloats.

The easy way of dealing with them is to have only one representation of
bigints/bigfloats and have some functions that deal with them. The
flexible way would be have only the interface of functions to access
pieces of the bignumber, and to operate bignumbers, and have the
different implementations implement this interface. I really don't see
the need for having different implementations of bignumbers, only if
integration with other language's implementations of bignumbers is
desired, in which case it would be possible to build a wrapper around
the other language's bignumber, instead of having to convert it to the
internal format.


6. Strings

Well, strings are other issue. Strings are the main data that is used in
Perl. Perl is a language for manipulating strings. Well, mainly strings.

The string object wasn't of much necessity until Perl 5, and it has
always been represented as `char *' i.e., a 0 terminated array of 1-byte
characters, all of them in the same encoding (ASCII or Latin-1). This
representation isn't perfect, since it isn't flexible, but it was good
enough so far. The problem comes with Unicode, because now strings can
have various encodings, and various widths for characters, even
characters with different widths in the same string. It's desirable that
Perl handles Unicode, since it's the base for many new technologies,
like XML, for example.

I know there are many proposals to fix the internal encoding of strings
so that independently of the external encoding used, the internal
encoding should be always the same (or one of three or four). I'm
radically against this, since I think flexibility is much more important
than that. There are arguments against using variable-width encodings
internally, but I think they're weakened in consideration of the use of
a char-offset vs. byte-offset scheme.

Strings should be efficient, yet flexible. While copy-on-write is
desired in some cases, sharing through immutable strings is desired in
others. We'll have to deal with multiple encodings externally, I don't
see why shouldn't we deal with multiple encodings internally as well.

Memory should also be considered. With Unicode in mind, there is the
fixed width vs. variable width characters problem. If we use fixed width
only, we may waste too much memory for big strings that basically don't
have extended characters, but using variable width is a problem when
extracting substrings.

In my opinion, flexibility is the way to go, because then the user can
have different strings implementations and choose between them (or
switch between them) as she pleases (and wants her code optimized).

For dealing with indexes in the string, I propose the use of two
indexes. One is the index of the character in the substring, cindex. It
indicates the position of the character, for example the eleventh
character would have cindex 10 (if we start at 0). The other index is
the byte index, bindex. It takes into account the byte position inside
the string. For example, in a variable character width string, where a
character can be 1 to 3 bytes, the bindex of the character with cindex
10 could be (almost) any value between 10 and 30. If all preceding
characters take up 1 byte, it would be 10, if 3 of them take up 2 bytes
and 1 takes up 3 bytes, it would be bindex 15.

The user would have a way of transforming cindex to bindex, and then
should use bindex in the functions that expect a pointer to a character
in the string. Actually, the value of bindex if opaque to the user, it's
only guaranteed to be an efficient pointer to a character inside the
string. If a vtable implements a string as a balanced tree, or some
other crazy stuff, the only prerequisite for bindex is that it uniquely
identifies a character inside the string efficiently (without having to
traverse the string to find the position).

For traversing a string, there would be an interface. You pass to the
string the bindex of the character you're interested, the number of
characters you need, and the width (or encoding) of the characters
you're expecting, besides of an optional function telling the string how
to ``downgrade'' the characters, e.g., how to convert a character with 4
significant bytes to a 1-byte character (help me you Unicode experts!).
You also pass this function a buffer (with enough room, according to the
substring size and the character width requested). The function would
then fill the buffer with the requested substring, in the requested
encoding, and would also return the bindex to the next character after
the substring. This function could be used to traverse a string
character by character or in blocks of n characters, using any byte
width. For example, it would be possible to iterate over a string
encoded in UTF-8, by having a 4-byte int holding a character at each
time very easily. The size of the string would also be queried as a
bindex so that it could be used to control the end of a loop iterating a
string (that uses a bindex as a pointer in the string).

Regexp matching should probably use bindex'es on immutable strings to
indicate matches, so that it's not needed to copy $1, $2, ... and $`,
$&, $' variables until they're used, what would solve one of the big
efficiency problems of Perl5 (about using $&).

Regexp's are actually pretty tied to strings. Actually, I think there
should be methods for matching and substituting in the string vtables,
what takes me to...


7. Regexp's.

Regexp's should probably have a vtable of their own. It's actually
possible to have a reference to a compiled regexp, by using qr//, so
that a value should also can be dereferenced as a regexp (pending...).
The operations =~ and !~ should probably also be methods in a vtable
(value or string?) so that they're overloadable.

The vtable of a regexp should probably give access to the regexp's guts,
so that it's possible to traverse the regexp as a tree through its
vtable.

What's the use of that? Well, if the regexp engine accesses the regexp
by its vtable, it would be possible to create other regexp syntaxes (or
extensions) and generate the tree of matching operations when the vtable
is queried.

Also, it would be possible to a module to take different actions
depending of the format of the regexp. For example, if a regexp is to be
used to seek words in a dictionary file, a module could inspect the
regexp to see if all the matching words have a common prefix. Then it
could use an index to seek the first word that begins with this prefix,
and match the regexp only to the words that begin with this prefix,
reducing the number of words to inspect in the dictionary. This is an
example of allowing efficient queries without loosing flexibility.


---


Some problems proposed in the RFC's that would be also solved by
exposing such an OO interface of Perl's guts would be:
- use of multiple string encodings internally,
- use of foreign objects (other language's) inside Perl,
- integration of bigints/bigfloats, and
- persistence/marshalling/serialization.


1. Use of multiple string encodings internally

This can be dealt easily by the strings vtable, that make all accesses
to a string through this interface, so that the internal encoding
doesn't matter for anything outside the implementation for this
interface. The access to the string (traversal) would be efficient
independent of its encoding.

The user should be able to ``ask'' a string to change its encoding (and
consequently its vtable) to use an encoding that would do operations
more efficiently on the string. For example, before using substr on
random positions of a string, the user could tell the string to change
to a fixed character width representation, so that the string doesn't
need to be traversed to find the bindex for each cindex.

2. Use of foreign objects (other language's) inside Perl

This could be done by having a sval_vtable implementation that
``proxies'' method calls of Perl to method calls in the other language.
The DESTROY method could be used to invoke the destructor of the other
language's object.

3. Integration of bigints/bigfloats

Simply by having sufficiently smart PLUS, MINUS, ... methods.

4. Persistence/marshalling/serialization

This could be done by adding SERIALIZE/DESERIALIZE (or
MARSHALL/UNMARSHALL) methods to the value vtable. On serializing, the
TYPE of the object's vtable would be written, then the output of
SERIALIZE. To de-serialize it, the written TYPE would be read. Then it
would be used to index a table of the available vtables. The DESERIALIZE
method of this vtable would be called, resulting a new value like the
original one.

Probably it would also be necessary to add SERIALIZE/DESERIALIZE methods
to variables, strings, numbers, regexps, code, fhandle, etc. so that
reference values would be dealt the right way.


---


I would like to try somethings for string vtables. I'm not quite sure
how they would be, since I'm not that familiar with Unicode, etc. So I'm
taking the following premises:
- a string is a sequence of characters,
- the length of the string is the number of characters in it,
- a character can take a variable number of bytes to be represented,
- there's an encoding, a `canonical' encoding, in which every character
  can be represented in exactly 4 bytes;
- there are other two `canonical' encodings, one with 2 and other with 1
  byte per character, that can be used to represent strings that don't
  use more than this number of bytes per character;
- the typical use of a string is by iterating its characters in order
  one by one, for example, by taking the first character of the string
  and processing it, then the second, the third, etc.

There's something I'm probably missing here that I read about in the
list that is the possibility of the same thing being represented with
different number of characters, for example an A with an acute could be
represented as one character or as two, one for the A and another for
the acute. I know too little about it to consider writing about it, so I
won't. I ask you Unicode experts to help me on this.

Well, the way I see to solve the problems of variable width encoding
*with efficiency* is to have two kinds of index in the string, the byte
index (or bindex) and the character index (or cindex). The cindex is the
actual offset of the character in the string, so that the 1000th
character of the string has cindex 999. The bindex is the internal
offset of the internal representation, and should be regarded as an
opaque value. For example, the bindex for the 1000th character of the
string above could be 999, if the string is an ASCII string, or it could
be anything from 999 to 3000, I think, if the string were encoded in
UTF-8. This would depend of how many bytes the characters before the
1000th character would take.

The two restrictions that are necessary for bindexes is that, in one
string, if one character comes after another (it's cindex is greater
than the other's one), the bindex of the former is also greater than the
latter's one. The other restriction is that if a character is the first
character of a string (and consequently has cindex 0), the bindex of
this character is also 0.


   typedef int bindex;
   typedef int cindex;
   typedef int char_t;     // 4-byte character

   struct string_vtable {
      int     (*CLENGTH)   (STRING *this);
      int     (*BLENGTH)   (STRING *this);
      int     (*MAX_WIDTH) (STRING *this);
      bindex  (*BINDEX)    (STRING *this, bindex start, cindex index);
      char_t  (*CHAR_AT)   (STRING *this, bindex index, bindex *next);
      void    (*SUBSTR_AT) (STRING *this, bindex start, int clength,
                            int char_width, char *buffer,
                            int *olength, bindex *next);
   };


Well, now the explanation. CLENGTH should return the length of the
string i.e., the number of characters the string contains. BLENGTH
should return the number of bytes the string contains (this isn't
completely true, it should actually return the bindex of the last
character of the string if the string were one character longer, this
only matters as the return value of next on the CHAR_AT and SUBSTR_AT
calls).

The MAX_WIDTH method returns the number of bytes that would be needed to
encode this string in one of the `canonical' encodings. This should be
1, 2 or 4 (by my premises, I don't know if this should be different).
The vtable implementation can always say 4 here and let the language use
the widest buffers possible, for a short string. However, this can
increase the memory needs of the application.

BINDEX translates a cindex to a bindex. It takes a bindex argument that
means from where it should start counting the characters, and a cindex,
that is the number of characters it should count. The bindex should be 0
to find the bindex of a cindex from the start of the string. For a fixed
width character string, this only means multiplying cindex for the width
of the characters, but for a variable width character string, this can
mean traverse the string from the starting bindex, character by
character, the number of characters indicated by cindex, until finding
the bindex of it.

CHAR_AT finds the character at a specific bindex. It returns the
character in the `canonical' 4-byte encoding, and fills the bindex
pointer passed as argument with the bindex of the next character.

BLOCK_AT does the same as CHAR_AT, it only gets several characters at a
time. The start argument indicates the starting position, the clength
indicates the number of characters, the char_width specifies the width
of the `canonical' encoding it should expect, buffer is the pointer to a
buffer that has at least clength * char_width bytes allocated to it,
olength is filled with the actual number of characters filled in the
buffer, and next is filled with the bindex of the character after the
last character read into the buffer.

How would one deal with a string (independently of its encoding) ?
Traversing it would be easy, by:


    STRING *the_string;
    bindex  the_length;
    bindex  i;
    char_t  c;

    the_length = the_string->vtable->BLENGTH(the_string);
    for(c  = the_string->vtable->CHAR_AT(the_string, 0, &i);
        i != the_length;
        c  = the_string->vtable->CHAR_AT(the_string, i, &i))
    {
        /* deal with c, which is the 4-byte character */
    }


Using SUBSTR_AT wouldn't be much harder, and it would be possible to
deal with the string more than one character at a time.

There's actually a quite big problem with using this in Perl. It's that
inside Perl, `substr' always gets cindexes as offset, so that the usual
Perl code that deals with `substr' is:


    $pos = 0;
    $length = length $string;
    while ($pos < $length) {
        $c = substr($string, $pos++, 1);
        # deal with $c
    }


This wouldn't be efficient, since it would have to call BINDEX to
convert the cindex to a bindex every time `substr' is called, and that's
not desirable. The solution I propose for this is to attach a `cache' to
a string with the last cindex -> bindex translation derived of the use
of CHAR_AT and SUBSTR_AT, so that after `substr($string, $pos, 1)', the
cindex `$pos + 1' would have its bindex cached. Anyway, this should
probably be handled by the interpreter, by putting some tags on the code
where the substr operation is being used.

Another problem I see is for when the code traverses the string from the
end backwards. But I think some encodings don't even support finding the
previous character from one character, so that traversing them backwards
wouldn't be possible. For these cases I would actually recommend
converting the string to a fixed character width encoding before
iterating it, either explicitly by a statement, or implicitly by the
compiler (by detecting this possibility in the code) or the runtime (by
seeing it's going backwards).

Anyway, I personally think that `substr' in a loop is a suboptimal way
to iterate strings in Perl. Regexp's should be used instead. And
regexp's go most of the time (if not always) forwards in the string.


---


Well, I think that's all I have to say on the subject. Now I'd like to
hear from you. Do this all make sense? Is it useful? Is it worth for
Perl 6? Is it too clumsy? Are there things I didn't mention here?


I'm expecting valuable feedback on this.


- Branden



__________________________________________________________
Get your FREE personalized e-mail at http://www.canada.com
Tying & Overloading

Reply via email to