RE: Theoretical Interest item - Pull Request with example schema that illustrates pathological (super-exponential) DFDL behaviors

Interrante, John A (GE Aerospace, US) Wed, 24 Jan 2024 15:00:26 -0800

Yes, the subject of DFDL infoset explosion interests me, and I took a look at 
the personal github PR.  I agree with the README's premise that the infoset's 
size should be linearly related to the input data's size by O(sn).  A runtime 
check for forward progress on the problematic DFDL 1.0 features seems 
worthwhile implementing.  A schema compile time check would be even better if a 
way is found to formally prove the theoretical O(sn) constraint.

-----Original Message-----
From: Mike Beckerle <[email protected]> 
Sent: Wednesday, January 24, 2024 1:55 PM
To: [email protected]
Subject: EXT: Theoretical Interest item - Pull Request with example schema that 
illustrates pathological (super-exponential) DFDL behaviors

I have been concerned for a while that combinations of the features in DFDL
v1.0 can be used in ways that are just far too powerful for what is needed for 
data format expression.

For example, if the input data to parse is n bits, how large should the 
corresponding infoset be when the data is parsed using a DFDL schema?
Clearly at least O(n) since there are n bits and each bit could create an 
element in the infoset.

But should it be able to be O(n^2), or even O(2^n) ?

I can think of no use case for this.

Intuition about data formats suggests that the infoset should always be 
linearly related in size to the size of the input data, and that the constant 
factor on that should be relatively small.
Furthermore, the time taken to parse data should be linearly related to the 
size of the data. I think lots of people are thinking DFDL parsing should be 
implementable in a circuit, and the memory requirements are proportional to the 
data size.
It doesn't make sense to try to make hardware implementations of DFDL if that's 
not going to be the case.

Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and 
dfdl:inputValueCalc, can create an infoset from parsing that is 
*super-exponentially larger than the input data*. Yes, it is finite, and 
bounded to some function of the size of the input data. But we don't even have 
notations for this kind of numeric growth.

The schema I created consumes a 3 bit integer as the data. The size of the 
infoset grows like this:

   - input 000 -> infoset size 1
   - input 001 -> infoset size 1
   - input 010 -> infoset size 256
   - input 011 -> infoset size is over 10,000 elements
   - ...
   - input 111 -> infoset size is larger than the number of atoms in the
   universe.

There are TDML tests for this schema that illustrate this (except that last
one)

But it is always finite! No recursion is required to do this. With infinite 
space and time available it WILL terminate. I believe the language remains 
theoretically decidable.

The README in this schema project discusses this and other pathological cases, 
and suggests things we can do to restrict DFDL so that this kind of 
pathological behavior is not allowed.

Really we want the largest infoset that can be created from

   - n bits of data
   - a schema of size s

to be O(sn) in size. I have started an argument in the README about why
O(sn) is the right target. Basically there are practical schemas that can meet 
O(sn) as a lower bound.

I believe this can be achieved by adding more forward-progress requirements to 
DFDL - all arrays should have to make forward progress when parsing, not just 
unbounded arrays, and reused groups/types should have to make forward progress 
at points of reuse. By doing that, we ensure that to make a giant infoset, you 
have to be consuming a giant number of bits.

Right now this stuff is just living in a personal github repo, but I'll give it 
an official better home either in DFDLSchemas or in daffodil-extra or someplace 
later.

If this is of interest, please take a look at this github pull request below.

---------- Forwarded message ---------
From: Mike Beckerle <[email protected]>
Date: Wed, Jan 24, 2024 at 3:50 PM
Subject: [mbeckerle/superexp] Example schema that illustrates pathological DFDL 
behaviors (PR #1)
To: mbeckerle/superexp <[email protected]>
Cc: Mike Beckerle <[email protected]>, Your activity < 
[email protected]>

Illustrates DFDL issues where parsing small data can lead to explosively larger 
parse output. Super exponentially larger.

The expressive power of DFDL should be restricted so that this is not possible.

See the README.md for discussion of the issue.
------------------------------
You can view, comment on, or merge this pull request online at:

  https://github.com/mbeckerle/superexp/pull/1
Commit Summary

   - 46dbae6

<https://github.com/mbeckerle/superexp/pull/1/commits/46dbae6b2dc5acd76d6954109f29cb147d0e68e5>
   First version of this example schema

File Changes

(9 files <https://github.com/mbeckerle/superexp/pull/1/files>)

   - *A* .gitattributes

<https://github.com/mbeckerle/superexp/pull/1/files#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393e>
   (5)
   - *M* .gitignore

<https://github.com/mbeckerle/superexp/pull/1/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947>
   (7)
   - *M* README.md

<https://github.com/mbeckerle/superexp/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5>
   (96)
   - *A* build.sbt

<https://github.com/mbeckerle/superexp/pull/1/files#diff-5634c415cd8c8504fdb973a3ed092300b43c4b8fc1e184f7249eb29a55511f91>
   (30)
   - *A* project/build.properties

<https://github.com/mbeckerle/superexp/pull/1/files#diff-cbc85b9df7818a480f215028e98821bd1c83adb304a9f03658595c44dddff754>
   (1)
   - *A* src/exp.dfdl.xsd

<https://github.com/mbeckerle/superexp/pull/1/files#diff-9751c75e50c64f92bee47a05b0800aa9c5691bcd5c2b1166a1b7dcfa6e5ebc65>
   (84)
   - *A* src/superexp.dfdl.xsd

<https://github.com/mbeckerle/superexp/pull/1/files#diff-2c057f270cab45f3f0577abc17a3472f2b86cc934f7bb72f58a911d344613572>
   (72)
   - *A* test/TestSuperexp.scala

<https://github.com/mbeckerle/superexp/pull/1/files#diff-c6b466d6536f0f670f97aeca9a516d93581c3c8327aa6b9f9fec6657a5e6dcf4>
   (35)
   - *A* test/TestSuperexp.tdml

<https://github.com/mbeckerle/superexp/pull/1/files#diff-f1b4e40b448ec3ab530fab92fdfebb51a705aa9fc69ea118a565ef046b0db8e0>
   (117)

Patch Links:

   - https://github.com/mbeckerle/superexp/pull/1.patch
   - https://github.com/mbeckerle/superexp/pull/1.diff

—
Reply to this email directly, view it on GitHub 
<https://github.com/mbeckerle/superexp/pull/1>, or unsubscribe 
<https://github.com/notifications/unsubscribe-auth/AALUDAZF7ZM5XWD72HZ4Z5DYQFX2NAVCNFSM6AAAAABCJM6FHOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TSMBYGEYDAOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: <mbeckerle/superexp/pull/[email protected]>

RE: Theoretical Interest item - Pull Request with example schema that illustrates pathological (super-exponential) DFDL behaviors

Reply via email to