Re: DRAFT RFC: Enhanced Pack/Unpack

Edwin Wiles Sat, 05 Aug 2000 12:38:13 -0700
Glenn, et.al.

I'm going to be combining a number of different comments in here.

Glenn Linderman wrote:
> I was surprised by the read/write operations, but have no objection to them.
> New/get/set and the individual data member access functions are the critical
> pieces, as the I/O could be done to normal variables, but it would take more
> steps that way, so read/write are nice enhancements.

I felt that the read/write enhancements were necessary.  Without them,
the programmer must calculate the required length of the reads
themselves.  While they may know the length of fixed elements, they will
not know the length of variable elements.  At least not without
pre-reading the fixed elements that provide that length.

Requiring them to do so would defeat one of the goals, easier use.

>    [ 'bar' => 'i', 'baz' => 'i', 'count' => 'i' ]

It is my understanding that "=>" is an indication of a link in a hash
array between its name and its value.  Given the necessity of
maintaining the order of elements, this nomenclature would appear to be
inappropriate.

Still, I like the idea of making the relationship more explicit, we
could do it this way:

        [ ['bar','int32'], ['baz','int32'], ['count','int32'] ]

As you can see, I agree with longer data type names.  However, in
response to a number of other postings to the language group, I'm going
to suggest that we adopt a different format.

PLEA FOR INFO!  Can anyone point me at a better/more detailed
explanation of the existing 'pack/unpack' format characters?  The one in
'perlfunc' leaves me wondering what a number of the markers are for, or
exactly how they work.

Consider 'p' and 'P'.  They're supposedly refering to a pointer to
either a string or structure, but they don't say whether the pointer is
relative to the first byte of the pointer, the first byte after the
pointer, or the first byte of the overall structure.

Basic Object Type:

'short' - whatever your computer understands is a short.
'long' - ditto.
'int' - ditto.
'double' - ditto.
'float' - ditto.
'char' - ditto, with the understanding that Unicode makes these
         characters larger than you would expect.
'byte' - whatever your computer understands is a byte.  (Yes, there
         are some systems that don't use 8 bits to a byte.  Not many,
         but they are there.)
'octet' - specifically 8 bits, regardless of the byte size of your
system.

To which we may append a number of different modifiers:

Endianness:

/d - default, whatever your system uses normally
/n - network
/b - big
/l - little
/v - vax

(??vax is sufficiently different to require it's own??  Or is this to do
with bit order?)

Signedness:

/u - unsigned
/s - signed

We may also allow these modifiers in the definition of the structure, so
that the entire structure can be affected, without having to explicitly
modify each variable.

Non-Standard sizes:

/[0-9]+ - The basic object is this many bits long.

I strongly suggest that we settle on bits as the standard of length,
since if we start mixing bits versus bytes on the basic elements, we're
going to confuse the living daylights out of people.

Definition of arrays:

[<length>] - The number of basic objects, as modified.

So, an array of 20 10-bit unsigned network order integers would be:

[ ['myarray','int/10/u/n[20]'] ]

Alternatively:

[ ['myarray','int/un10[20]'] ]

I know that looks complex, but I believe that the vast majority of cases
would likely be handled by the simplest forms, such as:

[ ['myfixedstring', 'char[30]'] ]

Which is an array of 30 unicode characters.  (Of course, if the decision
is made that we do not use unicode, or any other multi-lingual support,
then the specialness of 'char' versus 'byte' goes away.) (As you might
guess, I'm torn between human readable, easily parsed, and easy to
type.  I may have gone too far twoards 'easily parsed'.)

> 4) allow hooks to support non-standard types.
>   sub Structure::define ( <type name>, <frombinarysub>, <tobinarysub> )
> 
>   sub from_funny ( <type_params>, <binary_var>, <bit_offset> )
>   # returns ( <next_bit_offset>, <funny_var> )
> 
>   sub to_funny ( <type_params>, <binary_var>, <bit_offset>, <funny_var> )
>   # returns ( <next_bit_offset> )

This combination presumes that the data has already been loaded into an
internal variable.  Given variable length data, this is not a valid
assumption.

I considered adding a 'length' subroutine to the spec, which the main
pack code could call to determine if it needed to read any more data,
but realized that the 'length' subroutine itself might need to read more
data before it could determine the total length needed for 'from_funny'
to succeed.

Thus, 'from_funny' must either return a special code indicating that it
needs more data to work, with some indication of how much more to read;
or it must perform it's own reading, and therefore accept a file handle
to read from.

I'm leaning towards the first case, because we may have received our
data via a string, not a file handle.  In that event, there is no place
to read additional data from.

>   Structure::define ( 'funny', \&from_funny, \&to_funny );
> 
>   [ 'var1' => 'int32', 'var2' => ['funny', 6, 18, 12]]
> 
>   <type_params> is the reference to the array defined in the definition, in this
> example, it would be ['funny_type', 6, 18, 12]  (it appears that the type
> 'funny' has three parameters to define its storage characteristics).

        The format for 'funny' would turn to this:

        [ ['var1','int/32'], ['var2','funny/6/18/12'] ]

> (The handling of indefinite length arrays specified by earlier data.)

> >         [ 'bar', 'i', 'baz', 'i', 'count', 'i',
> >           repeat( 'count', [ 'length', 'i', 'offset', 'i' ] ) ]
> 
> Or  ['bar' => 'int32', 'baz' => 'int32', 'count' => 'int32',
>            'struct_2' => ['array', 'count', [ 'length' => 'int32', 'offset' =>
> 'int32' ]]]

Or [ ['bar','int/32'],['baz','int/32'],['count','int/32'],
        ['mystruct',
                ['array','count',[['length','int/32'],['offset','int/32']]]]]

(I think I have the right number of close brackets in there!)

The above presumes that 32 bit integers are not standard for this
system.  If they are, it becomes even simpler.

[['bar','int'],['baz','int'],['count','int'],['myarray',
        ['array','count',[['length','int'],['offset','int']]]]]

(Is it just me, or is this beginning to look like rectangular lisp?)

> >         Okay, that looks like it might work, now add in the strings
> >         referenced by length and offset.  [Ideas anyone?]
> 
> OK, here's a (Forth or Pascal or BASIC) counted string:

I think you missed my intent.  The number of strings is the same as the
number of length/offset pairs.  The length of each string is determined
by the corresponding 'length' value from the pair.  The position of each
string in the data following the already parsed portion is determined by
the 'offset' value from the pair.

We would have to have some indication of where the 'offset' is starting
from; the beginning of this structure, the first byte of the area
_following_ the array of length/offset pairs, etc.

I've chosen not to comment further, as we have enough differences at
this point to resolve. (And I'd like to get _something_ out the door!)

So, what'cha'think?

E.W.
Re: DRAFT RFC: Enhanced Pack/Unpack

Reply via email to