This example of the super-exponential explosion has been moved to: https://github.com/OpenDFDL/examples/tree/master/superexp
On Wed, Jan 24, 2024 at 6:00 PM Interrante, John A (GE Aerospace, US) < john.interra...@ge.com> wrote: > Yes, the subject of DFDL infoset explosion interests me, and I took a look > at the personal github PR. I agree with the README's premise that the > infoset's size should be linearly related to the input data's size by > O(sn). A runtime check for forward progress on the problematic DFDL 1.0 > features seems worthwhile implementing. A schema compile time check would > be even better if a way is found to formally prove the theoretical O(sn) > constraint. > > -----Original Message----- > From: Mike Beckerle <mbecke...@apache.org> > Sent: Wednesday, January 24, 2024 1:55 PM > To: dev@daffodil.apache.org > Subject: EXT: Theoretical Interest item - Pull Request with example schema > that illustrates pathological (super-exponential) DFDL behaviors > > I have been concerned for a while that combinations of the features in DFDL > v1.0 can be used in ways that are just far too powerful for what is needed > for data format expression. > > For example, if the input data to parse is n bits, how large should the > corresponding infoset be when the data is parsed using a DFDL schema? > Clearly at least O(n) since there are n bits and each bit could create an > element in the infoset. > > But should it be able to be O(n^2), or even O(2^n) ? > > I can think of no use case for this. > > Intuition about data formats suggests that the infoset should always be > linearly related in size to the size of the input data, and that the > constant factor on that should be relatively small. > Furthermore, the time taken to parse data should be linearly related to > the size of the data. I think lots of people are thinking DFDL parsing > should be implementable in a circuit, and the memory requirements are > proportional to the data size. > It doesn't make sense to try to make hardware implementations of DFDL if > that's not going to be the case. > > Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and > dfdl:inputValueCalc, can create an infoset from parsing that is > *super-exponentially larger than the input data*. Yes, it is finite, and > bounded to some function of the size of the input data. But we don't even > have notations for this kind of numeric growth. > > The schema I created consumes a 3 bit integer as the data. The size of the > infoset grows like this: > > - input 000 -> infoset size 1 > - input 001 -> infoset size 1 > - input 010 -> infoset size 256 > - input 011 -> infoset size is over 10,000 elements > - ... > - input 111 -> infoset size is larger than the number of atoms in the > universe. > > There are TDML tests for this schema that illustrate this (except that last > one) > > But it is always finite! No recursion is required to do this. With > infinite space and time available it WILL terminate. I believe the language > remains theoretically decidable. > > The README in this schema project discusses this and other pathological > cases, and suggests things we can do to restrict DFDL so that this kind of > pathological behavior is not allowed. > > Really we want the largest infoset that can be created from > > - n bits of data > - a schema of size s > > to be O(sn) in size. I have started an argument in the README about why > O(sn) is the right target. Basically there are practical schemas that can > meet O(sn) as a lower bound. > > I believe this can be achieved by adding more forward-progress > requirements to DFDL - all arrays should have to make forward progress when > parsing, not just unbounded arrays, and reused groups/types should have to > make forward progress at points of reuse. By doing that, we ensure that to > make a giant infoset, you have to be consuming a giant number of bits. > > Right now this stuff is just living in a personal github repo, but I'll > give it an official better home either in DFDLSchemas or in daffodil-extra > or someplace later. > > If this is of interest, please take a look at this github pull request > below. > > ---------- Forwarded message --------- > From: Mike Beckerle <notificati...@github.com> > Date: Wed, Jan 24, 2024 at 3:50 PM > Subject: [mbeckerle/superexp] Example schema that illustrates pathological > DFDL behaviors (PR #1) > To: mbeckerle/superexp <super...@noreply.github.com> > Cc: Mike Beckerle <mbecke...@apache.org>, Your activity < > your_activ...@noreply.github.com> > > > Illustrates DFDL issues where parsing small data can lead to explosively > larger parse output. Super exponentially larger. > > The expressive power of DFDL should be restricted so that this is not > possible. > > See the README.md for discussion of the issue. > ------------------------------ > You can view, comment on, or merge this pull request online at: > > https://github.com/mbeckerle/superexp/pull/1 > Commit Summary > > - 46dbae6 > < > https://github.com/mbeckerle/superexp/pull/1/commits/46dbae6b2dc5acd76d6954109f29cb147d0e68e5 > > > First version of this example schema > > File Changes > > (9 files <https://github.com/mbeckerle/superexp/pull/1/files>) > > - *A* .gitattributes > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-618cd5b83d62060ba3d027e314a21ceaf75d36067ff820db126642944145393e > > > (5) > - *M* .gitignore > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-bc37d034bad564583790a46f19d807abfe519c5671395fd494d8cce506c42947 > > > (7) > - *M* README.md > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-b335630551682c19a781afebcf4d07bf978fb1f8ac04c6bf87428ed5106870f5 > > > (96) > - *A* build.sbt > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-5634c415cd8c8504fdb973a3ed092300b43c4b8fc1e184f7249eb29a55511f91 > > > (30) > - *A* project/build.properties > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-cbc85b9df7818a480f215028e98821bd1c83adb304a9f03658595c44dddff754 > > > (1) > - *A* src/exp.dfdl.xsd > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-9751c75e50c64f92bee47a05b0800aa9c5691bcd5c2b1166a1b7dcfa6e5ebc65 > > > (84) > - *A* src/superexp.dfdl.xsd > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-2c057f270cab45f3f0577abc17a3472f2b86cc934f7bb72f58a911d344613572 > > > (72) > - *A* test/TestSuperexp.scala > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-c6b466d6536f0f670f97aeca9a516d93581c3c8327aa6b9f9fec6657a5e6dcf4 > > > (35) > - *A* test/TestSuperexp.tdml > < > https://github.com/mbeckerle/superexp/pull/1/files#diff-f1b4e40b448ec3ab530fab92fdfebb51a705aa9fc69ea118a565ef046b0db8e0 > > > (117) > > Patch Links: > > - https://github.com/mbeckerle/superexp/pull/1.patch > - https://github.com/mbeckerle/superexp/pull/1.diff > > — > Reply to this email directly, view it on GitHub < > https://github.com/mbeckerle/superexp/pull/1>, or unsubscribe < > https://github.com/notifications/unsubscribe-auth/AALUDAZF7ZM5XWD72HZ4Z5DYQFX2NAVCNFSM6AAAAABCJM6FHOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA4TSMBYGEYDAOI > > > . > You are receiving this because you are subscribed to this thread.Message > ID: <mbeckerle/superexp/pull/1...@github.com> >