I have been concerned for a while that combinations of the features in DFDL
v1.0 can be used in ways that are just far too powerful for what is needed
for data format expression.

For example, if the input data to parse is n bits, how large should the
corresponding infoset be when the data is parsed using a DFDL schema?
Clearly at least O(n) since there are n bits and each bit could create an
element in the infoset.

But should it be able to be O(n^2), or even O(2^n) ?

I can think of no use case for this.

Intuition about data formats suggests that the infoset should always be
linearly related in size to the size of the input data, and that the
constant factor on that should be relatively small.
Furthermore, the time taken to parse data should be linearly related to the
size of the data. I think lots of people are thinking DFDL parsing
should be implementable in a circuit, and the memory requirements are
proportional to the data size.
It doesn't make sense to try to make hardware implementations of DFDL if
that's not going to be the case.

Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and
dfdl:inputValueCalc, can create an infoset from parsing that is
*super-exponentially
larger than the input data*. Yes, it is finite, and bounded to some
function of the size of the input data. But we don't even have notations
for this kind of numeric growth.

The schema I created consumes a 3 bit integer as the data. The size of the
infoset grows like this:

   - input 000 -> infoset size 1
   - input 001 -> infoset size 1
   - input 010 -> infoset size 256
   - input 011 -> infoset size is over 10,000 elements
   - ...
   - input 111 -> infoset size is larger than the number of atoms in the
   universe.

There are TDML tests for this schema that illustrate this (except that last
one)

But it is always finite! No recursion is required to do this. With infinite
space and time available it WILL terminate. I believe the language remains
theoretically decidable.

The README in this schema project discusses this and other pathological
cases, and suggests things we can do to restrict DFDL so that this kind of
pathological behavior is not allowed.

Really we want the largest infoset that can be created from

   - n bits of data
   - a schema of size s

to be O(sn) in size. I have started an argument in the README about why
O(sn) is the right target. Basically there are practical schemas that can
meet O(sn) as a lower bound.

I believe this can be achieved by adding more forward-progress requirements
to DFDL - all arrays should have to make forward progress when parsing, not
just unbounded arrays, and reused groups/types should have to make forward
progress at points of reuse. By doing that, we ensure that to make a giant
infoset, you have to be consuming a giant number of bits.

Right now this stuff is just living in a personal github repo, but I'll
give it an official better home either in DFDLSchemas or in daffodil-extra
or someplace later.

If this is of interest, please take a look at this github pull request
below.

---------- Forwarded message ---------
From: Mike Beckerle <notificati...@github.com>
Date: Wed, Jan 24, 2024 at 3:50 PM
Subject: [mbeckerle/superexp] Example schema that illustrates pathological
DFDL behaviors (PR #1)
To: mbeckerle/superexp <super...@noreply.github.com>
Cc: Mike Beckerle <mbecke...@apache.org>, Your activity <
your_activ...@noreply.github.com>


Illustrates DFDL issues where parsing small data can lead to explosively
larger parse output. Super exponentially larger.

The expressive power of DFDL should be restricted so that this is not
possible.

See the README.md for discussion of the issue.
------------------------------
You can view, comment on, or merge this pull request online at:

  https://github.com/mbeckerle/superexp/pull/1
Commit Summary

   - 46dbae6
   
<https://github.com/mbeckerle/superexp/pull/1/commits/46dbae6b2dc5acd76d6954109f29cb147d0e68e5>
   First version of this example schema

File Changes

(9 files <https://github.com/mbeckerle/superexp/pull/1/files>)

   - *A* .gitattributes
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393e>
   (5)
   - *M* .gitignore
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947>
   (7)
   - *M* README.md
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5>
   (96)
   - *A* build.sbt
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-5634c415cd8c8504fdb973a3ed092300b43c4b8fc1e184f7249eb29a55511f91>
   (30)
   - *A* project/build.properties
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-cbc85b9df7818a480f215028e98821bd1c83adb304a9f03658595c44dddff754>
   (1)
   - *A* src/exp.dfdl.xsd
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-9751c75e50c64f92bee47a05b0800aa9c5691bcd5c2b1166a1b7dcfa6e5ebc65>
   (84)
   - *A* src/superexp.dfdl.xsd
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-2c057f270cab45f3f0577abc17a3472f2b86cc934f7bb72f58a911d344613572>
   (72)
   - *A* test/TestSuperexp.scala
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-c6b466d6536f0f670f97aeca9a516d93581c3c8327aa6b9f9fec6657a5e6dcf4>
   (35)
   - *A* test/TestSuperexp.tdml
   
<https://github.com/mbeckerle/superexp/pull/1/files#diff-f1b4e40b448ec3ab530fab92fdfebb51a705aa9fc69ea118a565ef046b0db8e0>
   (117)

Patch Links:

   - https://github.com/mbeckerle/superexp/pull/1.patch
   - https://github.com/mbeckerle/superexp/pull/1.diff

—
Reply to this email directly, view it on GitHub
<https://github.com/mbeckerle/superexp/pull/1>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AALUDAZF7ZM5XWD72HZ4Z5DYQFX2NAVCNFSM6AAAAABCJM6FHOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TSMBYGEYDAOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: <mbeckerle/superexp/pull/1...@github.com>

Reply via email to