This topic got discussed on the DFDL workgroup call today. An action item
was created to address the issue, however, it was observed that DFDL users
are depending on the transformation capabilities of DFDL now in IBM
products (Their z/TPF DFDL specifically). Not to create non-linearly large
output, but to create computed output the shape of which is not purely
matching the format.

In https://github.com/OpenDFDL/examples/tree/master/pairsTransform there is
an example of such a transform. This transposes an array. Specifically a
hidden group contains a pair of lists which matches the format, but
the non-hidden infoset is a list of pairs which are computed from the pair
of lists. I've asked one of the DFDL Workgroup IBMers to weigh in on
whether this pairsTransform is representative of the capabilities their
users need.

On Wed, Jan 24, 2024 at 6:00 PM Interrante, John A (GE Aerospace, US) <
john.interra...@ge.com> wrote:

> Yes, the subject of DFDL infoset explosion interests me, and I took a look
> at the personal github PR.  I agree with the README's premise that the
> infoset's size should be linearly related to the input data's size by
> O(sn).  A runtime check for forward progress on the problematic DFDL 1.0
> features seems worthwhile implementing.  A schema compile time check would
> be even better if a way is found to formally prove the theoretical O(sn)
> constraint.
>
> -----Original Message-----
> From: Mike Beckerle <mbecke...@apache.org>
> Sent: Wednesday, January 24, 2024 1:55 PM
> To: dev@daffodil.apache.org
> Subject: EXT: Theoretical Interest item - Pull Request with example schema
> that illustrates pathological (super-exponential) DFDL behaviors
>
> I have been concerned for a while that combinations of the features in DFDL
> v1.0 can be used in ways that are just far too powerful for what is needed
> for data format expression.
>
> For example, if the input data to parse is n bits, how large should the
> corresponding infoset be when the data is parsed using a DFDL schema?
> Clearly at least O(n) since there are n bits and each bit could create an
> element in the infoset.
>
> But should it be able to be O(n^2), or even O(2^n) ?
>
> I can think of no use case for this.
>
> Intuition about data formats suggests that the infoset should always be
> linearly related in size to the size of the input data, and that the
> constant factor on that should be relatively small.
> Furthermore, the time taken to parse data should be linearly related to
> the size of the data. I think lots of people are thinking DFDL parsing
> should be implementable in a circuit, and the memory requirements are
> proportional to the data size.
> It doesn't make sense to try to make hardware implementations of DFDL if
> that's not going to be the case.
>
> Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and
> dfdl:inputValueCalc, can create an infoset from parsing that is
> *super-exponentially larger than the input data*. Yes, it is finite, and
> bounded to some function of the size of the input data. But we don't even
> have notations for this kind of numeric growth.
>
> The schema I created consumes a 3 bit integer as the data. The size of the
> infoset grows like this:
>
>    - input 000 -> infoset size 1
>    - input 001 -> infoset size 1
>    - input 010 -> infoset size 256
>    - input 011 -> infoset size is over 10,000 elements
>    - ...
>    - input 111 -> infoset size is larger than the number of atoms in the
>    universe.
>
> There are TDML tests for this schema that illustrate this (except that last
> one)
>
> But it is always finite! No recursion is required to do this. With
> infinite space and time available it WILL terminate. I believe the language
> remains theoretically decidable.
>
> The README in this schema project discusses this and other pathological
> cases, and suggests things we can do to restrict DFDL so that this kind of
> pathological behavior is not allowed.
>
> Really we want the largest infoset that can be created from
>
>    - n bits of data
>    - a schema of size s
>
> to be O(sn) in size. I have started an argument in the README about why
> O(sn) is the right target. Basically there are practical schemas that can
> meet O(sn) as a lower bound.
>
> I believe this can be achieved by adding more forward-progress
> requirements to DFDL - all arrays should have to make forward progress when
> parsing, not just unbounded arrays, and reused groups/types should have to
> make forward progress at points of reuse. By doing that, we ensure that to
> make a giant infoset, you have to be consuming a giant number of bits.
>
> Right now this stuff is just living in a personal github repo, but I'll
> give it an official better home either in DFDLSchemas or in daffodil-extra
> or someplace later.
>
> If this is of interest, please take a look at this github pull request
> below.
>
> ---------- Forwarded message ---------
> From: Mike Beckerle <notificati...@github.com>
> Date: Wed, Jan 24, 2024 at 3:50 PM
> Subject: [mbeckerle/superexp] Example schema that illustrates pathological
> DFDL behaviors (PR #1)
> To: mbeckerle/superexp <super...@noreply.github.com>
> Cc: Mike Beckerle <mbecke...@apache.org>, Your activity <
> your_activ...@noreply.github.com>
>
>
> Illustrates DFDL issues where parsing small data can lead to explosively
> larger parse output. Super exponentially larger.
>
> The expressive power of DFDL should be restricted so that this is not
> possible.
>
> See the README.md for discussion of the issue.
> ------------------------------
> You can view, comment on, or merge this pull request online at:
>
>   https://github.com/mbeckerle/superexp/pull/1
> Commit Summary
>
>    - 46dbae6
>    <
> https://github.com/mbeckerle/superexp/pull/1/commits/46dbae6b2dc5acd76d6954109f29cb147d0e68e5
> >
>    First version of this example schema
>
> File Changes
>
> (9 files <https://github.com/mbeckerle/superexp/pull/1/files>)
>
>    - *A* .gitattributes
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393e
> >
>    (5)
>    - *M* .gitignore
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947
> >
>    (7)
>    - *M* README.md
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5
> >
>    (96)
>    - *A* build.sbt
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-5634c415cd8c8504fdb973a3ed092300b43c4b8fc1e184f7249eb29a55511f91
> >
>    (30)
>    - *A* project/build.properties
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-cbc85b9df7818a480f215028e98821bd1c83adb304a9f03658595c44dddff754
> >
>    (1)
>    - *A* src/exp.dfdl.xsd
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-9751c75e50c64f92bee47a05b0800aa9c5691bcd5c2b1166a1b7dcfa6e5ebc65
> >
>    (84)
>    - *A* src/superexp.dfdl.xsd
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-2c057f270cab45f3f0577abc17a3472f2b86cc934f7bb72f58a911d344613572
> >
>    (72)
>    - *A* test/TestSuperexp.scala
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-c6b466d6536f0f670f97aeca9a516d93581c3c8327aa6b9f9fec6657a5e6dcf4
> >
>    (35)
>    - *A* test/TestSuperexp.tdml
>    <
> https://github.com/mbeckerle/superexp/pull/1/files#diff-f1b4e40b448ec3ab530fab92fdfebb51a705aa9fc69ea118a565ef046b0db8e0
> >
>    (117)
>
> Patch Links:
>
>    - https://github.com/mbeckerle/superexp/pull/1.patch
>    - https://github.com/mbeckerle/superexp/pull/1.diff
>
> —
> Reply to this email directly, view it on GitHub <
> https://github.com/mbeckerle/superexp/pull/1>, or unsubscribe <
> https://github.com/notifications/unsubscribe-auth/AALUDAZF7ZM5XWD72HZ4Z5DYQFX2NAVCNFSM6AAAAABCJM6FHOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TSMBYGEYDAOI
> >
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: <mbeckerle/superexp/pull/1...@github.com>
>

Reply via email to