On 6/12/2011 11:29 AM, Lukas Lueg wrote:

This sort of speculative idea might fit the python-ideas list better.

[Summary: we often need to extract a field or two from a binary record in order to decide whether to toss it or unpack it all and process.]

One solution to this is using two format-strings instead of only one
(e.g. '4s4s i 4s2s2s'): One that unpacks just the filtered fields
(e.g. '8x i 8x') and one that unpacks all the fields except the one
already created by the filter (e.g. '4s4s  4x  4s2s2s'). This solution
works very well and increases throughput by far. It however also
creates complexity in the code as we have to keep track and combine
field-values that came from the filtering-part with the ones unpacked
during inspection-part (we don't want to simply unpack twice).

With just 1 or 2 filter fields, and very many other fields, I would just unpack everything, including the filter field. I expect the extra time to do that would be comparalbe to the extra time to combine. It certainly would make your code easier. I suspect you could write a function to create the filter field only format by field number from the everything format.

I'd like to propose an enhancement to the struct module that should
solve this dilemma and ask for your comments.

The function s_unpack_internal() inside _struct.c currently unpacks
all values from the buffer-object passed to it and returns a tuple
holding these values. Instead, the function could create a tuple-like
object that holds a reference to it's own Struct-object (which holds
the format) and a copy of the memory it is supposed to unpack. This
object allows access to the unpacked values through the sequence
protocol, basically unpacking the fields if - and only if - accessed
through sq_item (e.g. foo = struct.unpack('2s2s', 'abcd'); foo[0] ==
'ab'). The object can also unpack all fields only once (as all
unpacked objects are immutable, we can hold references to them and
return these instead once known). This approach is possible because
there are no further error conditions inside the unpacking-functions
that we would *have* to deal with at the time .unpack() is called; in
other words: Unpacking can't fail if the format-string's syntax had
been correct and can therefor be deferred (while packing can't).

I understand that this may seem like a single-case-optimization.

Yep.

We
can however assume that most people will benefit from the new behavior
unknowingly while everyone else takes now harm:

I will not assume that without code and timings. I would expect that unpacking one field at a time would take longer than all at once. To me, this is the sort of thing that should be written, listed on PyPI, and tested by multiple users on multiple systems first.

--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com

Reply via email to