RE: [Python-Dev] an idea for improving struct.unpack api

Ilya Sandler Fri, 07 Jan 2005 19:37:40 -0800

I will try to respond to all comments at once.

But first a clarification:
   -I am not trying to design a high-level API on top of existing
   struct.unpack and
   -I am not trying to design a replacement for struct.unpack
   (If I were to replace struct.unpack(), then I would probably go along
  the lines of StructReader suggested by Raymond)


I view struct module as a low-level (un)packing library on top on which
a more complex stuff can be built and I am simply suggesting a way to
improve this low level functionality...


> > We could have an optional offset argument for
> >
> > unpack(format, buffer, offset=None)
> >
> > the offset argument is an object which contains a single integer field
> > which gets incremented inside unpack() to point to the next byte.

> As for "passing offset implies the length is calcsize(fmt)" sub-concept,
> I find that slightly more controversial.  It's convenient,
> but somewhat ambiguous; in other cases (e.g. string methods) passing a
> start/offset and no end/length means to go to the end.

I am not sure I agree: in most cases starting offset and no
length/end just means: "start whatever you are doing at this offset and
stop it whenever you are happy.."

At least that's the way I was alway thinking about functions like
string.find() and friends

Suggested struct.unpack() change seems to fit this mental model very well

>> the offset argument is an object which contains a single integer field
>> which gets incremented inside unpack() to point to the next byte.

> I find this just too "magical".

Why would it be magical? There is no guessing of user intentions involved.
The function simply  returns/uses an extra piece of information if the user
 asks for it. And the function already computes this piece of
information..


> It's only useful when you're specifically unpacking data bytes that are
>  compactly back to back (no "filler" e.g. for alignment purposes)

Yes, but it's a very common case when dealing with binary files formats.

Eg. I just looked at xdrlib.py code and it seems that almost every
invocation of struct._unpack would shrink from 3 lines to 1 line of code

(        i = self.__pos
        self.__pos = j = i+4
        data = self.__buf[i:j]
        return struct.unpack('>l', data)[0]

would become:
        return struct.unpack('>l', self.__buf, self.__pos)[0]
)


There are probably other places in stdlib which would benefit from this
api and stdlib does not deal with binary files that much..

>and pays some conceptual price -- introducing a new specialized type
> to play the role of "mutable int"

but the user does not have to pay anything if he does not need it! The
change is backward compatible. (Note that just supporting int offsets
would eliminate slicing, but it would not eliminate other annoyances,
and it's  possible to support both Offset and int args, is it worth the
hassle?)

> and having an argument mutated, which is not usual in Python's library.

Actually, it's so common that we simply stop noticing it :-)
Eg. when we call a superclass's method:
  SuperClass.__init__(self)

So, while I agree that there is an element of unusualness in the
suggested unpack() API, this element seems pretty small to me

> All in all, I suspect that something like.
> hdrsize = struct.calcsize(hdr_fmt)
> itemsize = struct.calcsize(item_fmt)
> reclen = length_of_each_record
> rec = binfile.read(reclen)
> hdr = struct.unpack(hdr_fmt, rec, 0, hdrsize)
>     for offs in itertools.islice(xrange(hdrsize, reclen, itemsize),
hdr[0]):
>         item = struct.unpack(item_fmt, rec, offs, itemsize)
>         # process item
>might be a better compromise

I think I again disagree: your example is almost as verbose as the current
unpack() api and you still need to call calcsize() explicitly and I
don't think there is any chance of gaining any noticeable
perfomance benefit. Too little gain to bother with any changes...


> struct.pack/struct.unpack is already one of my least-favourite parts
> of the stdlib.  Of the modules I use regularly, I pretty much only ever
> have to go back and re-read the struct (and re) documentation because
> they just won't fit in my brain. Adding additional complexity to them
> seems like a net loss to me.

Net loss to the end programmer? But if he does not need new
functionality he doesnot have to use it! In fact, I started with providing
an example of how new api makes client code simpler


> I'd much rather specify the format as something like a tuple of values -
>    (INT, UINT, INT, STRING) (where INT &c are objects defined in the
>    struct module). This also then allows users to specify their own formats
>    if they have a particular need for something

I don't disagree, but I think it's orthogonal to offset issue


Ilya




On Thu, 6 Jan 2005, Raymond Hettinger wrote:

> [Ilya Sandler]
> > A problem:
> >
> > The current struct.unpack api works well for unpacking C-structures
> where
> > everything is usually unpacked at once, but it
> > becomes  inconvenient when unpacking binary files where things
> > often have to be unpacked field by field. Then one has to keep track
> > of offsets, slice the strings,call struct.calcsize(), etc...
>
> Yes.  That bites.
>
>
> > Eg. with a current api unpacking  of a record which consists of a
> > header followed by a variable  number of items would go like this
> >
> >  hdr_fmt="iiii"
> >  item_fmt="IIII"
> >  item_size=calcsize(item_fmt)
> >  hdr_size=calcsize(hdr_fmt)
> >  hdr=unpack(hdr_fmt, rec[0:hdr_size]) #rec is the record to unpack
> >  offset=hdr_size
> >  for i in range(hdr[0]): #assume 1st field of header is a counter
> >    item=unpack( item_fmt, rec[ offset: offset+item_size])
> >    offset+=item_size
> >
> > which is quite inconvenient...
> >
> >
> > A  solution:
> >
> > We could have an optional offset argument for
> >
> > unpack(format, buffer, offset=None)
> >
> > the offset argument is an object which contains a single integer field
> > which gets incremented inside unpack() to point to the next byte.
> >
> > so with a new API the above code could be written as
> >
> >  offset=struct.Offset(0)
> >  hdr=unpack("iiii", offset)
> >  for i in range(hdr[0]):
> >     item=unpack( "IIII", rec, offset)
> >
> > When an offset argument is provided, unpack() should allow some bytes
> to
> > be left unpacked at the end of the buffer..
> >
> >
> > Does this suggestion make sense? Any better ideas?
>
> Rather than alter struct.unpack(), I suggest making a separate class
> that tracks the offset and encapsulates some of the logic that typically
> surrounds unpacking:
>
>     r = StructReader(rec)
>     hdr = r('iiii')
>     for item in r.getgroups('IIII', times=rec[0]):
>        . . .
>
> It would be especially nice if it handled the more complex case where
> the next offset is determined in-part by the data being read (see the
> example in section 11.3 of the tutorial):
>
>     r = StructReader(open('myfile.zip', 'rb'))
>     for i in range(3):                  # show the first 3 file headers
>         fields = r.getgroup('LLLHH', offset=14)
>         crc32, comp_size, uncomp_size, filenamesize, extra_size = fields
>         filename = g.getgroup('c', offset=16, times=filenamesize)
>         extra = g.getgroup('c', times=extra_size)
>         r.advance(comp_size)
>         print filename, hex(crc32), comp_size, uncomp_size
>
> If you come up with something, I suggest posting it as an ASPN recipe
> and then announcing it on comp.lang.python.  That ought to generate some
> good feedback based on other people's real world issues with
> struct.unpack().
>
>
> Raymond Hettinger
>
>
_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

RE: [Python-Dev] an idea for improving struct.unpack api

Reply via email to