Re: paper on tools and use cases for parsing binary data

Mike Beckerle Sun, 21 Apr 2024 07:32:37 -0700

Hi Amy,

Re: ASN.1 ECN I think differentiating your Kaitai system from it would be
worthwhile. I find ECN impenetrable, but Kaitai is intuitive, so it would
help me a lot if you compared them. :-)

Re: DFDL examples, numerous DFDL examples, simple and complex, are on
https://github.com/DFDLSchemas. These are not Cyber-security specific.
They're just schemas for the formats. You should be able to convert, or
even compile, these DFDL schemas into your format descriptions for Kaitai.
In fact the whole point of pursuing ISO standardization for DFDL (which
finally came through this month) was to make the effort of crafting such
tools worthwhile and durable in value.

You may also find this of interest as a non Cyberian example:
https://cloud.google.com/blog/products/application-modernization/dfdl-processing-with-google-cloud
which is a data normalization service some googlers created using Apache
Daffodil.

In addition I almost have DFDL working as part of Apache Drill, enabling
SQL querying against any data that has a DFDL description, and joining
against any of the other data types Apache Drill supports. I plan for this
to be complete this summer.

Also https://github.com/OpenDFDL has examples as well.

And... Since you brought up the Turing complete/non-completeness conundrum,
one example of interest is
https://github.com/OpenDFDL/examples/tree/master/superexp

So DFDL is not a Turing complete language, on purpose. There are no
unbounded loops (due to a forward progress requirement) and no recursion is
allowed. So a parse must always terminate.

But.... it is capable of creating a parse result that is
(super?)-exponentially larger than the input, which is _almost_ as bad as
being Turing complete. For a DFDL schema of size S, and input data of size
N, the parse result can be O(N^(2^S)). Such schemas are pathological of
course, and this number is still O(N) since the size of the schema is a
constant.

We are looking to add practical restrictions to ensure that for data of
size N, the parse result from a schema can only be size O(SN) where S is
the size of the schema. I.e, truly linear. Doing that while retaining the
ability to compute necessary things like lengths, totals, and counts that
are required by the format is the language challenge, though for practical
purposes there are numerous ad-hoc solutions involving adding more forward
progress requirements when parsing at run time. We'd really like a
language-theoretic schema-compilation-time way to guarantee this O(SN)
behavior.

This whole "exponential, but not Turing complete" thing is actually a
complexity class that is fairly interesting from CS perspective. System F (
https://en.wikipedia.org/wiki/System_F) is a CS theory lambda-calc-type
language which is also not Turing complete but can compute _almost_ forever
on small input. I believe that DFDL is in this complexity class because of
the ability to embed arrays (not recursively, but nested to some finite
depth given the size of the schema) the length of which is induced from the
size N of the data. The question is are their languages where O(SN) can be
guaranteed, that are still useful practically.

Well, that may have been TL;DR, but you hit on one of my hot topics.

-mike beckerle

On Fri, Apr 19, 2024 at 3:36 PM Roberts, Amy L2 <[email protected]>
wrote:

> Hi Mike,
>
> The wikipedia description is pretty complete in terms of what we're
> looking to include in the paper.  I am interested in including an example
> workflow and possibly implementation gotchas, especially those that appear
> across tools.  For example, one of the questions that has come up for
> Kaitai is the dangers in making the description langauge turing-complete
> vs. the benefits.
>
> The other thing that would be helpful would be more specific (where
> possible)  examples.  I'm not sure how feasible that is if most of the use
> cases are cybersecurity related, but if anyone does have an example(s)
> they're able to share that would be compelling!
>
> When you say it would be helpful to compare and contrast approaches, do
> you mean DFDL and ECN?
>
> Thanks so much for your response!
>
> Amy
>
> ------------------------------
> *From:* Mike Beckerle <[email protected]>
> *Sent:* Friday, April 19, 2024 10:09 AM
> *To:* [email protected] <[email protected]>
> *Subject:* Re: paper on tools and use cases for parsing binary data
>
> [External Email - Use Caution]
> Hi Amy,
>
> What would you be looking for beyond what is already in our Wikipedia
> description https://en.wikipedia.org/wiki/Data_Format_Description_Language
>  ?
>
> One thing I would emphasize about DFDL is that most people think data
> parsing is the hard problem, and data serialization is relatively simple,
> but if the serializer actually solves the problem of computing all the
> stored lengths for you, so that the user doesn't have to know anything
> about the actual format, then data serialization turns out to be far more
> complex than parsing. Particularly if you want streaming behavior. I gave a
> talk about this back at the ApacheCon conference in 2018. The slides are
> available here: https://s.apache.org/apacheconNA2018-dfdl
>
> To date, the primary use case for DFDL is cybersecurity related. Data must
> be parsed/ validated and "unparsed" back to original form in order to
> ensure that the data is, in fact, in that format and will not crash
> applications. The threat is not so much malware as "just bad data" causing
> denial-of-service.
>
> For your study, I would suggest you also look at ASN.1 Encoding Control
> Notation. This has been an ISO standard since 2008. ASN.1 which we normally
> think of as a prescriptive data format, but ECN extends it so that you
> specify the representation of the data. See:
> https://en.wikipedia.org/wiki/Encoding_Control_Notation
>
> I think it would be very helpful if a paper really compared/contrasted
> these approaches.
>
>
> On Fri, Apr 19, 2024 at 10:24 AM Roberts, Amy L2 <[email protected]>
> wrote:
>
> Hello!
>
> I am working with a team on a tool, Awkward Kaitai, that gives people
> tools to work with binary data once that data has been described with a
> custom language.  If you're interested in more details the project is
> currently hosted at
> https://github.com/ManasviGoyal/kaitai_struct_awkward_runtime and is
> meant to integrate with a larger project, https://kaitai.io/.
>
> I am writing because DFDL is a tool that solves a similar problem.
>
> We are currently writing a paper that provides examples of different
> custom-data problems in different domains and provides an overview of tools
> that help scientists work with such data and I wanted to reach out to the
> DFDL community to see if anyone would be interested in joining our paper as
> an author.
>
> I'd be delighted to have you contribute in any way you'd like as an
> author, and am particularly interested in having you:
>
> - Contribute a section about your tool
> - Show how your tool deals with a toy data file (I'm suggesting
> https://github.com/det-lab/dataReaderWriter/blob/master/kaitai/ksy/animal.ksy 
> but
> would be happy to consider other options!)
> - Help identify any similar tools that we should include in our review
> - Help identify any use cases that we could include in our "Use Cases"
> section
>
> Thanks so much for your work in this area!
>
> Best,
>
> Amy
>
> Amy Roberts
>
> Assistant Professor of Physics
>
> [email protected]
>
>

Re: paper on tools and use cases for parsing binary data

Reply via email to