Re: Library for parsing binary structures

2019-03-30 Thread Cameron Simpson

On 30Mar2019 10:29, Paul Moore  wrote:

On Fri, 29 Mar 2019 at 23:21, Cameron Simpson  wrote:


On 27Mar2019 18:41, Paul Moore  wrote:
>I'm looking for a library that lets me parse binary data structures.
>The stdlib struct module is fine for simple structures, but when it
>gets to more complicated cases, you end up doing a lot of the work by
>hand (which isn't that hard, and is generally perfectly viable, but
>I'm feeling lazy ;-))

I wrote my own: cs.binary, available on PyPI. The PyPI page has is
module docs, which I think are ok:

  https://pypi.org/project/cs.binary/


Nice, thanks - that's exactly the sort of pointer I was looking for.
I'll take a look and see how it works for my use case.


I'd be happy to consider adapting some stuff for your use cases; as you 
may imagine it is written to my use cases.


Also, I should point you at the cs.binary.structtuple factory, which 
makes a class for those structures trivially defined with a struct 
format string. As it uses struct for the parse step and transcribe 
steps, so it should be performant. Here's an example from the 
cs.iso14496 module:


 PDInfo = structtuple('PDInfo', '>LL', 'rate initial_delay')

which makes a PDInfo class for 2 big endian unsigned longs with .rate 
and .initial_delay attributes.


Cheers,
Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-30 Thread Paul Moore
On Fri, 29 Mar 2019 at 23:21, Cameron Simpson  wrote:
>
> On 27Mar2019 18:41, Paul Moore  wrote:
> >I'm looking for a library that lets me parse binary data structures.
> >The stdlib struct module is fine for simple structures, but when it
> >gets to more complicated cases, you end up doing a lot of the work by
> >hand (which isn't that hard, and is generally perfectly viable, but
> >I'm feeling lazy ;-))
>
> I wrote my own: cs.binary, available on PyPI. The PyPI page has is
> module docs, which I think are ok:
>
>   https://pypi.org/project/cs.binary/

Nice, thanks - that's exactly the sort of pointer I was looking for.
I'll take a look and see how it works for my use case.

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-29 Thread Cameron Simpson

On 30Mar2019 09:44, Cameron Simpson  wrote:

On 27Mar2019 18:41, Paul Moore  wrote:

I'm looking for a library that lets me parse binary data structures.
The stdlib struct module is fine for simple structures, but when it
gets to more complicated cases, you end up doing a lot of the work by
hand (which isn't that hard, and is generally perfectly viable, but
I'm feeling lazy ;-))


I wrote my own: cs.binary, available on PyPI. The PyPI page has is 
module docs, which I think are ok:


https://pypi.org/project/cs.binary/

[...]

and here's an ISO14496 (the MP4 format) parser using it:
https://pypi.org/project/cs.iso14496/
Of interest is that ISO 14496 uses recursive data structures.


I neglected to mention: with cs.binary you write binary formats as 
classes (which allows for easy conditional parsing and so forth).


And... normally those classes know how to write themselves back out, 
which makes for easy transcription and binary data generation.


Conditional binary formats require a class specific .transcribe method 
(which just yields binary data or some other convenient things including 
other binary class instances, see doco) but flat records have a default 
.transcribe.


Cheers,
Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-29 Thread Cameron Simpson

On 27Mar2019 18:41, Paul Moore  wrote:

I'm looking for a library that lets me parse binary data structures.
The stdlib struct module is fine for simple structures, but when it
gets to more complicated cases, you end up doing a lot of the work by
hand (which isn't that hard, and is generally perfectly viable, but
I'm feeling lazy ;-))


I wrote my own: cs.binary, available on PyPI. The PyPI page has is 
module docs, which I think are ok:


 https://pypi.org/project/cs.binary/

Here's a binary packet protocol built on to of it:

 https://pypi.org/project/cs.packetstream/

and here's an ISO14496 (the MP4 format) parser using it:

 https://pypi.org/project/cs.iso14496/

Of interest is that ISO 14496 uses recursive data structures.

The command line "main" function is up the top, which shows how it is 
used.


Cheers,
Cameron Simpson 
--
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-29 Thread Peter J. Holzer
On 2019-03-29 16:34:35 +, Paul Moore wrote:
> On Fri, 29 Mar 2019 at 16:16, Peter J. Holzer  wrote:
> 
> > Obviously you need some way to describe the specific binary format you
> > want to parse - in other words, a grammar. The library could then use
> > the grammar to parse the input - either by interpreting it directly, or
> > by generating (Python) code from it. The latter has the advantage that
> > it has to be done only once, not every time you want to parse a file.
> >
> > If that sounds familiar, it's what yacc does. Except that it does it for
> > text files, not binary files. I am not aware of any generic binary
> > parser generator for Python. I have read research papers about such
> > generators for (I think) C and Java, but I don't remember the names and
> > I'm not sure if the generators got beyond the proof of concept stage.
> 
> That's precisely what I'm looking at. The construct library
> (https://pypi.org/project/construct/) basically does that, but using a
> DSL implemented in Python rather than generating Python code from a
> grammar.

Good to know. I'll add that to my list of Tools Which I'm Not Likely To
Use Soon But Which May Be Useful Some Day.


> However, the resulting parser works, but it gives horrible error
> messages. This is a normal problem with generated parsers, there are
> plenty of books and articles covering how to persuade tools like yacc
> to produce usable error reports on parse failures.

Yeah, that still seems to be an unsolved problem.

> I don't know which solution I'll ultimately use, but it's an
> interesting exercise doing it both ways. And parsing binary data,
> unlike parsing text, is actually easy enough that hand crafting a
> parser isn't that much of a bother - maybe that's why there's less
> existing work in this area.

I'm a bit sceptical about that. Writing a hand-crafted parser for most
text-based grammars isn't that hard either, but there are readily-
available tools (like yacc), so people use them (despite problems like
horrible error messages). For binary protocols, such tools are much less
well-known. It may be true that binary grammars seem simpler. But in
practice there are lots and lots of security holes because hand-crafted
parsers tend to use un-warranted shortcuts (see heart-bleed or the JPEG
parsing bug of the week), which an automatically generated parser would
not take.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-29 Thread Paul Moore
On Fri, 29 Mar 2019 at 16:16, Peter J. Holzer  wrote:

> Obviously you need some way to describe the specific binary format you
> want to parse - in other words, a grammar. The library could then use
> the grammar to parse the input - either by interpreting it directly, or
> by generating (Python) code from it. The latter has the advantage that
> it has to be done only once, not every time you want to parse a file.
>
> If that sounds familiar, it's what yacc does. Except that it does it for
> text files, not binary files. I am not aware of any generic binary
> parser generator for Python. I have read research papers about such
> generators for (I think) C and Java, but I don't remember the names and
> I'm not sure if the generators got beyond the proof of concept stage.

That's precisely what I'm looking at. The construct library
(https://pypi.org/project/construct/) basically does that, but using a
DSL implemented in Python rather than generating Python code from a
grammar. In fact, the problem I had with my recursive data structure
turned out to be solvable in construct - as the DSL effectively builds
a data structure describing the grammar, I was able to convert the
problem of writing a recursive grammar into one of writing a recursive
data structure:

type_layouts = {}
layout1 = 
layout2 = 
type_layouts[1] = layout1
type_layouts[2] = layout2
data_layout = 

However, the resulting parser works, but it gives horrible error
messages. This is a normal problem with generated parsers, there are
plenty of books and articles covering how to persuade tools like yacc
to produce usable error reports on parse failures. There don't seem to
be any particularly good error reporting features in construct
(although I haven't looked closely), so I'm actually now looking at
writing a hand-crafted parser, just to control the error reporting[1].

I don't know which solution I'll ultimately use, but it's an
interesting exercise doing it both ways. And parsing binary data,
unlike parsing text, is actually easy enough that hand crafting a
parser isn't that much of a bother - maybe that's why there's less
existing work in this area.

Paul

[1] The errors I'm reporting on are likely to be errors in my parsing
code at this point, rather than errors in the data, but the problem is
pretty much the same either way ;-)
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-29 Thread Dan Sommers

On 3/29/19 12:13 PM, Peter J. Holzer wrote:


Obviously you need some way to describe the specific binary format you
want to parse - in other words, a grammar. The library could then use
the grammar to parse the input - either by interpreting it directly, or
by generating (Python) code from it. The latter has the advantage that
it has to be done only once, not every time you want to parse a file.

If that sounds familiar, it's what yacc does. Except that it does it for
text files, not binary files. I am not aware of any generic binary
parser generator for Python. I have read research papers about such
generators for (I think) C and Java, but I don't remember the names and
I'm not sure if the generators got beyond the proof of concept stage.


It's been a while since I've used those tools, but if you
create a lexer (the yylex() function) that can tokenize a
binary stream, then yacc won't know the difference.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-29 Thread Peter J. Holzer
On 2019-03-28 11:07:22 +0100, dieter wrote:
> Paul Moore  writes:
> > My real interest is in whether any libraries exist to do this sort
> > of thing (there are plenty of parser libraries for text, pyparsing
> > being the obvious one, but far fewer for binary structures).
> 
> Sure. *BUT* the library must fit your specific binary structure.
> How should a general libary know how to interpret your
> specific "type byte"s or that "(" introduces a homogenous
> sequence of given length which must be terminated by ")"?

Obviously you need some way to describe the specific binary format you
want to parse - in other words, a grammar. The library could then use
the grammar to parse the input - either by interpreting it directly, or
by generating (Python) code from it. The latter has the advantage that
it has to be done only once, not every time you want to parse a file.

If that sounds familiar, it's what yacc does. Except that it does it for
text files, not binary files. I am not aware of any generic binary
parser generator for Python. I have read research papers about such
generators for (I think) C and Java, but I don't remember the names and
I'm not sure if the generators got beyond the proof of concept stage.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-28 Thread dieter
Paul Moore  writes:

> On Thu, 28 Mar 2019 at 08:15, dieter  wrote:
> ...
> My real interest is in whether any
> libraries exist to do this sort of thing (there are plenty of parser
> libraries for text, pyparsing being the obvious one, but far fewer for
> binary structures).

Sure. *BUT* the library must fit your specific binary structure.
How should a general libary know how to interpret your
specific "type byte"s or that "(" introduces a homogenous
sequence of given length which must be terminated by ")"?

On the other hand, if those specifics are known, then
the remaining is trivial (as shown in my previous message).


If the binary structure is not fixed (i.e.
you deserialize only things you yourself have serialized), then you
can use Python's "pickle" (and likely also "marshal").
It supports the structuring you need and (among
others) the Python elementary types.


There is also "asn1" (--> "https://pypi.org/project/asn1/;)
for ASN.1 (BER/DER) binary formats.
ASN.1 is a widely used very flexible language to describe structured data
(which typically has a binary encoding) -- used e.g.
by LDAP and X.509. It supports (among others) an extremely rich set of
elementary types and structuring via "Sequence", "Set" and "Choice".

The elementary binary format is "tag value".
This is near to your "type_byte value". However, "tag" is
not a byte. Instead, it consists of a number (identifying the
type within its class), a class and an encoding indication.
This more general type specification is necessary as
in the general case, a byte is not sufficient to identify all
possible relevant types.

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-28 Thread Paul Moore
On Thu, 28 Mar 2019 at 08:15, dieter  wrote:
> What you have is a generalized deserialization problem.
> It can be solved with a set of deserializers.

Yes, and thanks for the suggested code structure. As I say, I can
certainly do the parsing "by hand", and the way you describe is very
similar to how I'd approach that. My real interest is in whether any
libraries exist to do this sort of thing (there are plenty of parser
libraries for text, pyparsing being the obvious one, but far fewer for
binary structures).

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Library for parsing binary structures

2019-03-28 Thread dieter
Paul Moore  writes:
> I'm looking for a library that lets me parse binary data structures.
> The stdlib struct module is fine for simple structures, but when it
> gets to more complicated cases, you end up doing a lot of the work by
> hand (which isn't that hard, and is generally perfectly viable, but
> I'm feeling lazy ;-))
>
> I know of Construct, which is a nice declarative language, but it's
> either weak, or very badly documented, when it comes to recursive
> structures. (I really like Construct, and if I could only understand
> the docs better I may well not need to look any further, but as it is,
> I can't see anything showing how to do recursive structures...) I am
> specifically trying to parse a structure that looks something like the
> following:
>
> Multiple instances of:
>   - a type byte
>   - a chunk of data structured based on the type
> types include primitives like byte, integer, etc, as well as
> (type byte, count, data) - data is "count" occurrences of data of
> the given type.

What you have is a generalized deserialization problem.
It can be solved with a set of deserializers.

def deserialize(file):
  """read the beginning of file and return the corresponding object."""

In the above case, you have a mapping "type byte --> deserializer",
called "TYPE" and (obviously) "(" is one such "type byte".

The deserializer corresponding to "(" is:
def sequence_deserialize(file):
  type_byte = file.read(1)
  if not type_byte: raise EOFError()
  type = TYPE[type_byte]
  count = TYPE[INT].deserialize(file)
  seq = [type.deserialize(file) for i in range(count)]
  assert file.read(1) == ")"
  return seq

The top level "deserialize" could look like:
def top_deserialize(file):
  """generates all values found in *file*."""
  while True:
type_byte = file.read(1)
if not type_byte: return
yield TYPE[type_byte].deserialize(file)


-- 
https://mail.python.org/mailman/listinfo/python-list


Library for parsing binary structures

2019-03-27 Thread Paul Moore
I'm looking for a library that lets me parse binary data structures.
The stdlib struct module is fine for simple structures, but when it
gets to more complicated cases, you end up doing a lot of the work by
hand (which isn't that hard, and is generally perfectly viable, but
I'm feeling lazy ;-))

I know of Construct, which is a nice declarative language, but it's
either weak, or very badly documented, when it comes to recursive
structures. (I really like Construct, and if I could only understand
the docs better I may well not need to look any further, but as it is,
I can't see anything showing how to do recursive structures...) I am
specifically trying to parse a structure that looks something like the
following:

Multiple instances of:
  - a type byte
  - a chunk of data structured based on the type
types include primitives like byte, integer, etc, as well as
(type byte, count, data) - data is "count" occurrences of data of
the given type.

That last one is a list, and yes, you can have lists of lists, so the
structure is recursive.

Does anyone know of any other binary data parsing libraries, that can
handle recursive structures reasonably cleanly?

I'm already *way* past the point where it would have been quicker for
me to write the parsing code by hand rather than trying to find a
"quick way", so the questions, honestly mostly about finding out what
people recommend for jobs like this rather than actually needing
something specific to this problem. But I do keep hitting the need to
parse binary structures, and having something in my toolbox for the
future would be really nice.

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list