Re: Theoretical Interest item - Pull Request with example schema that illustrates pathological (super-exponential) DFDL behaviors

Mike Beckerle Wed, 07 Feb 2024 10:59:49 -0800

This example of the super-exponential explosion has been moved to:
https://github.com/OpenDFDL/examples/tree/master/superexp



On Wed, Jan 24, 2024 at 6:00 PM Interrante, John A (GE Aerospace, US) <
[email protected]> wrote:

> Yes, the subject of DFDL infoset explosion interests me, and I took a look
> at the personal github PR.  I agree with the README's premise that the
> infoset's size should be linearly related to the input data's size by
> O(sn).  A runtime check for forward progress on the problematic DFDL 1.0
> features seems worthwhile implementing.  A schema compile time check would
> be even better if a way is found to formally prove the theoretical O(sn)
> constraint.
>
> -----Original Message-----
> From: Mike Beckerle <[email protected]>
> Sent: Wednesday, January 24, 2024 1:55 PM
> To: [email protected]
> Subject: EXT: Theoretical Interest item - Pull Request with example schema
> that illustrates pathological (super-exponential) DFDL behaviors
>
> I have been concerned for a while that combinations of the features in DFDL
> v1.0 can be used in ways that are just far too powerful for what is needed
> for data format expression.
>
> For example, if the input data to parse is n bits, how large should the
> corresponding infoset be when the data is parsed using a DFDL schema?
> Clearly at least O(n) since there are n bits and each bit could create an
> element in the infoset.
>
> But should it be able to be O(n^2), or even O(2^n) ?
>
> I can think of no use case for this.
>
> Intuition about data formats suggests that the infoset should always be
> linearly related in size to the size of the input data, and that the
> constant factor on that should be relatively small.
> Furthermore, the time taken to parse data should be linearly related to
> the size of the data. I think lots of people are thinking DFDL parsing
> should be implementable in a circuit, and the memory requirements are
> proportional to the data size.
> It doesn't make sense to try to make hardware implementations of DFDL if
> that's not going to be the case.
>
> Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and
> dfdl:inputValueCalc, can create an infoset from parsing that is
> *super-exponentially larger than the input data*. Yes, it is finite, and
> bounded to some function of the size of the input data. But we don't even
> have notations for this kind of numeric growth.
>
> The schema I created consumes a 3 bit integer as the data. The size of the
> infoset grows like this:
>
>    - input 000 -> infoset size 1
>    - input 001 -> infoset size 1
>    - input 010 -> infoset size 256
>    - input 011 -> infoset size is over 10,000 elements
>    - ...
>    - input 111 -> infoset size is larger than the number of atoms in the
>    universe.
>
> There are TDML tests for this schema that illustrate this (except that last
> one)
>
> But it is always finite! No recursion is required to do this. With
> infinite space and time available it WILL terminate. I believe the language
> remains theoretically decidable.
>
> The README in this schema project discusses this and other pathological
> cases, and suggests things we can do to restrict DFDL so that this kind of
> pathological behavior is not allowed.
>
> Really we want the largest infoset that can be created from
>
>    - n bits of data
>    - a schema of size s
>
> to be O(sn) in size. I have started an argument in the README about why
> O(sn) is the right target. Basically there are practical schemas that can
> meet O(sn) as a lower bound.
>
> I believe this can be achieved by adding more forward-progress
> requirements to DFDL - all arrays should have to make forward progress when
> parsing, not just unbounded arrays, and reused groups/types should have to
> make forward progress at points of reuse. By doing that, we ensure that to
> make a giant infoset, you have to be consuming a giant number of bits.
>
> Right now this stuff is just living in a personal github repo, but I'll
> give it an official better home either in DFDLSchemas or in daffodil-extra
> or someplace later.
>
> If this is of interest, please take a look at this github pull request
> below.
>
> ---------- Forwarded message ---------
> From: Mike Beckerle <[email protected]>
> Date: Wed, Jan 24, 2024 at 3:50 PM
> Subject: [mbeckerle/superexp] Example schema that illustrates pathological
> DFDL behaviors (PR #1)
> To: mbeckerle/superexp <[email protected]>
> Cc: Mike Beckerle <[email protected]>, Your activity <
> [email protected]>
>
>
> Illustrates DFDL issues where parsing small data can lead to explosively
> larger parse output. Super exponentially larger.
>
> The expressive power of DFDL should be restricted so that this is not
> possible.
>
> See the README.md for discussion of the issue.
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
>   https://github.com/mbeckerle/superexp/pull/1
> Commit Summary
>
>    - 46dbae6
>    <
> https://github.com/mbeckerle/superexp/pull/1/commits/46dbae6b2dc5acd76d6954109f29cb147d0e68e5
> >
>    First version of this example schema
>
> File Changes
>
> (9 files <https://github.com/mbeckerle/superexp/pull/1/files>)
>
>    - *A* .gitattributes
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393e
> >
>    (5)
>    - *M* .gitignore
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947
> >
>    (7)
>    - *M* README.md
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5
> >
>    (96)
>    - *A* build.sbt
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-5634c415cd8c8504fdb973a3ed092300b43c4b8fc1e184f7249eb29a55511f91
> >
>    (30)
>    - *A* project/build.properties
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-cbc85b9df7818a480f215028e98821bd1c83adb304a9f03658595c44dddff754
> >
>    (1)
>    - *A* src/exp.dfdl.xsd
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-9751c75e50c64f92bee47a05b0800aa9c5691bcd5c2b1166a1b7dcfa6e5ebc65
> >
>    (84)
>    - *A* src/superexp.dfdl.xsd
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-2c057f270cab45f3f0577abc17a3472f2b86cc934f7bb72f58a911d344613572
> >
>    (72)
>    - *A* test/TestSuperexp.scala
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-c6b466d6536f0f670f97aeca9a516d93581c3c8327aa6b9f9fec6657a5e6dcf4
> >
>    (35)
>    - *A* test/TestSuperexp.tdml
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-f1b4e40b448ec3ab530fab92fdfebb51a705aa9fc69ea118a565ef046b0db8e0
> >
>    (117)
>
> Patch Links:
>
>    - https://github.com/mbeckerle/superexp/pull/1.patch
>    - https://github.com/mbeckerle/superexp/pull/1.diff
>
> —
> Reply to this email directly, view it on GitHub <
> https://github.com/mbeckerle/superexp/pull/1>, or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AALUDAZF7ZM5XWD72HZ4Z5DYQFX2NAVCNFSM6AAAAABCJM6FHOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TSMBYGEYDAOI
> >
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: <mbeckerle/superexp/pull/[email protected]>
>

Re: Theoretical Interest item - Pull Request with example schema that illustrates pathological (super-exponential) DFDL behaviors

Reply via email to