Bytecode portablilty

Bryan C. Warnock Sun, 09 Dec 2001 16:48:01 -0800

For the past week or so, I've been working on design and code within
this area.  While I am waiting for answers to my last couple of
questions, I'm throwing this out there.  This is mostly FYI, partly
RFC, and not an FIX.


Portability:

There are four levels of potential portability and portability obstacles.

- The Parrot source code needs to be portable across all platforms that
we wish to support.  This is done via standards adherence, major
configuration checks, and, as little as possible, parallel code
branches.  This level of portability is required, for obvious reasons.

- The Parrot bytecode needs to be parsable across multiple platforms.
Strings need to be identifiable as strings, numbers as numbers, and
opcodes as the opcodes that they are to represent.  This is the level
of portability that I am addressing.

- The Parrot bytecode needs to execute consistently across multiple
platforms.  This means an 'add' opcode needs to add everywhere, and
not, say, get the hostname.  This is where portability breaks down,
because the functionality that Parrot opcodes are encoding are
themselves not truly portable.  Where possible, functionality should be
emulated.  Where not, documented.  Creating a symbolic link is an
example.  

- Parrot data needs to be consistent across multiple platforms.  This
is portability in user space, and there's little we can do about it. 
'open file'  is only portable provided 'file' actually exists on all
systems.

Goals:

As stated above, I only wish to address the second layer - the ability
of Parrot to be able to parse bytecode, regardless of the platforms
involved.

- Parrot bytecode should be most efficient when compiling for and
running on the native platform.  Perl's number one priority is always
speed where it  counts.

- Parrot bytecode should be efficiently compiled for or run on the
majority of supported systems.

- Parrot bytecode should be compilable for or runnable on all platforms
that Parrot supports.

- These goals should be addressed simultaneously.

Obstacles:

There are currently five obstacles to portable byte code.

- Endianness.  The three major types are Big, Little, and Vaxian. 
Supporting these three should handle the majority of cases.

- Supported sizes.  The current design assumes at least 32 bits.  Most
platforms have a 32 bit type, of which some have 64 bit support.  Some
64 bit machines support 32 bits, and others do not.

- Alignment.  Along the same lines as above, 64 bit machines sometimes
have 64 bit alignment requirements.

- Floating point representations.  The four major types are IEEE(ish),
Vaxian, Cray's CRI, and the IBM/370 hexadecimal format.  There are some
minor variations among these, particularly with how much of the
IEEE-754 standard floating point operations adhere to.  However,
adherence falls more into Portability Layer Three, and we will solely
address representation.

- File size limit.  Some systems are limited to files of a certain
size.

Solutions:

Each of these obstacles are solvable; some in more than one way.

- As long as the required size is known, it is trivial to convert
endianness from one form to another.  Although this is most efficient
when the item to be transformed is properly word aligned and fits
within a native integer type, it is possible to transform data of
arbitrary types.

- Requiring data representation to be either 32 or 64 bits maximizes
bytecode portability, while maintaining native efficiency, and
minimizing the numbers of sizes to support.  Requiring internal data
itself (such as opcode numbers) to be limited to 32 bits assures valid
downgrading of 64 bit formatted data.

- Aligning structures of 64 bits or greater on 64 bit boundaries
assures the most efficient boundary conditions for all supported
platforms.

- The majority of platforms can handle some form of 64 bit floating
point number.  All of the formats are easily interchangeable, although
some IEEE  semantics are not.  (And to ensure Level Three portability,
may need to be virtualized at that level.)  Three of the four types
share similar ranges - the Cray being the one exception.  Data
limitation (similar to above) will assure proper portability between
native types.  An alternate solution would be to encode floating point
numbers as strings, only to reconvert them on parsing.

- We can set an arbitrary limit on produced bytecode.  Given that we
are producing our own file formats, we should be capable of producing
multi-part bytecode files on machines that don't support file sizes
greater than some arbitrary size - nominally, 2 GB.  This limitation
would be independent of providing large file support for user data
files.


Proposal:

For background, revisit my proposed Bytecode Format (v2) at
http:[EMAIL PROTECTED]/msg05640.html.
Although it is outdated, is gives a general gist of the direction of my
thinking.  In particular, pay no heed to the incremental, relative
addressing of each section.  By capping bytecode to an arbitrary size,
we should be able to do direct indexing.

- All bytecode is by default written in native endianness.  This
maximizes efficiency for the native format (goal 1), and leaves reading
by other platforms efficient (goal 2).  Alternately, the user should be
able to write or convert bytecode to another format.

- All bytecode, except for the floating point constant table, will be
written in either 32 bit or 64 bit types, which ever is more efficient
for the native platform.  All integer values are limited to 32 bit
values, which will be written in the lower 32 bits of 64 bit types. 
Larger values should be converted to BigInt types by the assembler. 
The floating point constant table will be written in the 64 bit type. 
This maximizes efficiency for the native format (goal 1), and leaves
reading by other platforms efficient (goal 2).  Alternately, the user
should be able to write or convert bytecode to another format.

- All sections should be 64 bit aligned.  This levies minimal overhead
for platforms that don't require it, but alleviates major overhead for
those that do.

- Floating point constants should be written in the native format.  All
constants so encoded must fit within the representable range of all
major floating point types.  Larger floating point numbers should be
converted to BigFloat types by the assembler.  This maximizes
efficiency for the native format (goal 1), and leaves reading by other
platforms efficient (goal 2).  Alternately, the user should be able to
write or convert bytecode to another format.

- We set a 2 GB hard cap on bytecode files, and define a continuation
policy.  (Although, personally, if we produce files of that size,
*somebody* needs to be shot.)

Other considerations:

Given that bytecode is simply data, many of the solutions overlap with 
other areas of Parrot and Perl, such as within pack and unpack.  If
implemented, the user can pack and unpack arbitrary data for or from an
arbitrary platform.

Or, if you don't care much for the idea of portable byte-code, think
about the reverse situation.  Having the ability to address arbitrary
data (which is good from a Perl Language perspective) is also what
gives us the ability to have portable bytecode for little cost.

For platforms that don't fall within the confines of these restrictions, it 
should still be feasable to accurately reconstruct a file, although it may 
no longer be efficient.

So where am I?

There is an obvious difference between coming up with an idea, and then
doing the work to make the idea come to fruition.  An oft-quoted response to 
a proposal is, "Where's the patch?" - used as both an invitation and an 
obstacle.  Here's what I've done code-wise.  This is all independent code, 
not yet integrated into any Parrot baseline.

- I've code that currently supports endian conversion amongst the three
types for 32, 64, 96, and 128 byte structures, optimized for both 32
bit and 64 bit support.  Backfilling to support 16 bit support should
be trivial. I'm still trying to come up with a better interface and
implementation, however.

- Bytecode size is simply a matter of doing it.

- Alignment is simply a matter of doing it.

- I've code that currently converts 32, 64, 96, and 128 bit floating
point representations among all but the IBM format (for which I have
the algorithms on paper, but nowhere to test), optimized for both 32
bit and 64 bit support.  Although 96 and 128 bit handling is currently
hardcoded specifically for conversions between long doubles on x86
machines and 64 bit processors, I've got alpha code for casting among
arbitrary types.  (For casting to and from 32 bit floats on machines
that have no such type, for instance.)  IEEE semantics are *not*
supported, and are still a matter for discussion.  The implementation
of over- and underflow conversion to BigFloat is missing, for obvious
reasons.  I'm still trying to come up with a better interface and
implementation, however.

- File size is simply a matter of doing it.

-- 
Bryan C. Warnock
[EMAIL PROTECTED]

Bytecode portablilty

Reply via email to