For the past week or so, I've been working on design and code within this area. While I am waiting for answers to my last couple of questions, I'm throwing this out there. This is mostly FYI, partly RFC, and not an FIX.
Portability: There are four levels of potential portability and portability obstacles. - The Parrot source code needs to be portable across all platforms that we wish to support. This is done via standards adherence, major configuration checks, and, as little as possible, parallel code branches. This level of portability is required, for obvious reasons. - The Parrot bytecode needs to be parsable across multiple platforms. Strings need to be identifiable as strings, numbers as numbers, and opcodes as the opcodes that they are to represent. This is the level of portability that I am addressing. - The Parrot bytecode needs to execute consistently across multiple platforms. This means an 'add' opcode needs to add everywhere, and not, say, get the hostname. This is where portability breaks down, because the functionality that Parrot opcodes are encoding are themselves not truly portable. Where possible, functionality should be emulated. Where not, documented. Creating a symbolic link is an example. - Parrot data needs to be consistent across multiple platforms. This is portability in user space, and there's little we can do about it. 'open file' is only portable provided 'file' actually exists on all systems. Goals: As stated above, I only wish to address the second layer - the ability of Parrot to be able to parse bytecode, regardless of the platforms involved. - Parrot bytecode should be most efficient when compiling for and running on the native platform. Perl's number one priority is always speed where it counts. - Parrot bytecode should be efficiently compiled for or run on the majority of supported systems. - Parrot bytecode should be compilable for or runnable on all platforms that Parrot supports. - These goals should be addressed simultaneously. Obstacles: There are currently five obstacles to portable byte code. - Endianness. The three major types are Big, Little, and Vaxian. Supporting these three should handle the majority of cases. - Supported sizes. The current design assumes at least 32 bits. Most platforms have a 32 bit type, of which some have 64 bit support. Some 64 bit machines support 32 bits, and others do not. - Alignment. Along the same lines as above, 64 bit machines sometimes have 64 bit alignment requirements. - Floating point representations. The four major types are IEEE(ish), Vaxian, Cray's CRI, and the IBM/370 hexadecimal format. There are some minor variations among these, particularly with how much of the IEEE-754 standard floating point operations adhere to. However, adherence falls more into Portability Layer Three, and we will solely address representation. - File size limit. Some systems are limited to files of a certain size. Solutions: Each of these obstacles are solvable; some in more than one way. - As long as the required size is known, it is trivial to convert endianness from one form to another. Although this is most efficient when the item to be transformed is properly word aligned and fits within a native integer type, it is possible to transform data of arbitrary types. - Requiring data representation to be either 32 or 64 bits maximizes bytecode portability, while maintaining native efficiency, and minimizing the numbers of sizes to support. Requiring internal data itself (such as opcode numbers) to be limited to 32 bits assures valid downgrading of 64 bit formatted data. - Aligning structures of 64 bits or greater on 64 bit boundaries assures the most efficient boundary conditions for all supported platforms. - The majority of platforms can handle some form of 64 bit floating point number. All of the formats are easily interchangeable, although some IEEE semantics are not. (And to ensure Level Three portability, may need to be virtualized at that level.) Three of the four types share similar ranges - the Cray being the one exception. Data limitation (similar to above) will assure proper portability between native types. An alternate solution would be to encode floating point numbers as strings, only to reconvert them on parsing. - We can set an arbitrary limit on produced bytecode. Given that we are producing our own file formats, we should be capable of producing multi-part bytecode files on machines that don't support file sizes greater than some arbitrary size - nominally, 2 GB. This limitation would be independent of providing large file support for user data files. Proposal: For background, revisit my proposed Bytecode Format (v2) at http:[EMAIL PROTECTED]/msg05640.html. Although it is outdated, is gives a general gist of the direction of my thinking. In particular, pay no heed to the incremental, relative addressing of each section. By capping bytecode to an arbitrary size, we should be able to do direct indexing. - All bytecode is by default written in native endianness. This maximizes efficiency for the native format (goal 1), and leaves reading by other platforms efficient (goal 2). Alternately, the user should be able to write or convert bytecode to another format. - All bytecode, except for the floating point constant table, will be written in either 32 bit or 64 bit types, which ever is more efficient for the native platform. All integer values are limited to 32 bit values, which will be written in the lower 32 bits of 64 bit types. Larger values should be converted to BigInt types by the assembler. The floating point constant table will be written in the 64 bit type. This maximizes efficiency for the native format (goal 1), and leaves reading by other platforms efficient (goal 2). Alternately, the user should be able to write or convert bytecode to another format. - All sections should be 64 bit aligned. This levies minimal overhead for platforms that don't require it, but alleviates major overhead for those that do. - Floating point constants should be written in the native format. All constants so encoded must fit within the representable range of all major floating point types. Larger floating point numbers should be converted to BigFloat types by the assembler. This maximizes efficiency for the native format (goal 1), and leaves reading by other platforms efficient (goal 2). Alternately, the user should be able to write or convert bytecode to another format. - We set a 2 GB hard cap on bytecode files, and define a continuation policy. (Although, personally, if we produce files of that size, *somebody* needs to be shot.) Other considerations: Given that bytecode is simply data, many of the solutions overlap with other areas of Parrot and Perl, such as within pack and unpack. If implemented, the user can pack and unpack arbitrary data for or from an arbitrary platform. Or, if you don't care much for the idea of portable byte-code, think about the reverse situation. Having the ability to address arbitrary data (which is good from a Perl Language perspective) is also what gives us the ability to have portable bytecode for little cost. For platforms that don't fall within the confines of these restrictions, it should still be feasable to accurately reconstruct a file, although it may no longer be efficient. So where am I? There is an obvious difference between coming up with an idea, and then doing the work to make the idea come to fruition. An oft-quoted response to a proposal is, "Where's the patch?" - used as both an invitation and an obstacle. Here's what I've done code-wise. This is all independent code, not yet integrated into any Parrot baseline. - I've code that currently supports endian conversion amongst the three types for 32, 64, 96, and 128 byte structures, optimized for both 32 bit and 64 bit support. Backfilling to support 16 bit support should be trivial. I'm still trying to come up with a better interface and implementation, however. - Bytecode size is simply a matter of doing it. - Alignment is simply a matter of doing it. - I've code that currently converts 32, 64, 96, and 128 bit floating point representations among all but the IBM format (for which I have the algorithms on paper, but nowhere to test), optimized for both 32 bit and 64 bit support. Although 96 and 128 bit handling is currently hardcoded specifically for conversions between long doubles on x86 machines and 64 bit processors, I've got alpha code for casting among arbitrary types. (For casting to and from 32 bit floats on machines that have no such type, for instance.) IEEE semantics are *not* supported, and are still a matter for discussion. The implementation of over- and underflow conversion to BigFloat is missing, for obvious reasons. I'm still trying to come up with a better interface and implementation, however. - File size is simply a matter of doing it. -- Bryan C. Warnock [EMAIL PROTECTED]