Re: The future of the daffodil DFDL schema debugger?

2021-05-24 Thread Adam Rosien
Your message is extremely helpful! I'll spend some time working through it
and follow up.

On Mon, May 24, 2021 at 9:48 AM Beckerle, Mike <
mbecke...@owlcyberdefense.com> wrote:

> Some thoughts re: data format debugger
>
> I suggest we enumerate
>
>   *   every single piece of state of the parser,
>   *   every single piece of state of the unparser,
>   *   each action/step of the parser,  (every parse combinator or
> primitive, their subactions)
>   *   and of the unparser, (every unparse combinator, primitive,
> suspension,...)
>
> and wire-frame/mock-up some display for each piece of state, and how, if
> changed by a step, the change to that piece of state would be displayed.
>
> We can write down the nuances associated with these data items/actions
> that impact debugger display.
>
> Some of these states/actions will be analogous to things in conventional
> debuggers. (e.g., looking at the values of variables) Others will be
> specific to DFDL needs. (e.g., looking at layers in the data stream,
> visualizing delimiter scanning success/failure, backtracking)
>
> Core concepts a debugger needs are framing vs. content vs. value, and the
> "regions" in the data stream that make these up. The framing includes
> initiators, terminators, separators, alignment regions, prefix-length
> regions, leading/trailing skip regions, unused regions. Those surround the
> content region, and when padding/filling is involved (for simple types that
> are textual) the content region contains leading pad and trailing pad
> regions, surrounding the value region.
>
> An example of graphical nested box representation of these regions is here
> in a design note about Daffodil:
>
>
> https://daffodil.apache.org/dev/design-notes/term-sharing-in-schema-compiler/
> (see section "Details of Unique and Shared Regions")
>
> The way to start this effort is to look at the UState and PState classes.
> These are the state blocks. Every piece of these is potentially important
> to the debugger.
>
> Lastly, an important aspect of Daffodil is the streaming behavior of the
> parser and unparser. While I believe it is more important to get something
> working than for it to cover every feature, this is an area where not
> anticipating how it needs to work is likely to lock one out of a future
> scenario that accomodates it.
>
> So the parser doesn't produce an infoset. It  produces a stream of infoset
> events, or call-backs to be exact.
> Due to backtracking in the parser, these events can be hung-up for
> substantial time while the parser continues. So we can't assume that there
> is any sort of correlation between parser activity and the producing of
> events.
>
> The unparser doesn't consume an infoset, It consumes a stream of infoset
> events. Specifically, the unparser is the callback-handler for unparse
> infoset events.
>
> The infoset gets trimmed so that we needn't build up the complete infoset
> tree in memory. As parse-events are produced, no-longer necessary parts of
> the infoset are pruned away. Similarly, when unparsing, once a part of the
> infoset has been unparsed, that part of the infoset tree is pruned away if
> no longer needed.
>
>
> 
> From: Steve Lawrence 
> Sent: Thursday, April 22, 2021 9:32 AM
> To: dev@daffodil.apache.org 
> Subject: Re: The future of the daffodil DFDL schema debugger?
>
> Some thoughts related to showing the infoset as if it were a variable as
> this is prototyped
>
> 1) How do DAP/IDE's represent very large hierarchical data? Infosets can
> be huge, and most of the time a user only cares about the most recent
> infoset item. So someway to follow and show just the most recent part of
> the infoset is important. The current Daffodil debugger as an
> "infosetLines" setting so that it only shows the most recent X number of
> lines, which is most all a user cares about when stepping through a parse.
>
> 2) Infoset items are added and removed very frequently during a parse.
> Currently, when the Daffodil debugger shows the infoset it just converts
> the entire thing to XML and displays that. This doesn't work at all for
> large infosets since this can take a long time. I was hoping this issue
> would get resolved with this new debugging infrastructure. When the
> infoset is modified, we ideally want a way to specify via DAP that parts
> of the variable hierarchy were added/removed rather than having to send
> the entire infoset during every variable update.
>
> 3) I can imagine a feature where a user would want to select an infoset
> item and jump to the associated schema element, or query information
> about that infoset item (e.g.. what bit position did it start at, what
> was the length). We don't have this right now, but would be really nice
> to have. This suggests that we need metadata associated with each of the
> variables. Does DAP have a concept of that and do IDE's have a way to
> show it?
>
> On 4/21/21 7:52 PM, Adam Rosien wrote:
> > I've been reading up on 

Re: The future of the daffodil DFDL schema debugger?

2021-05-24 Thread Beckerle, Mike
Some thoughts re: data format debugger

I suggest we enumerate

  *   every single piece of state of the parser,
  *   every single piece of state of the unparser,
  *   each action/step of the parser,  (every parse combinator or primitive, 
their subactions)
  *   and of the unparser, (every unparse combinator, primitive, suspension,...)

and wire-frame/mock-up some display for each piece of state, and how, if 
changed by a step, the change to that piece of state would be displayed.

We can write down the nuances associated with these data items/actions that 
impact debugger display.

Some of these states/actions will be analogous to things in conventional 
debuggers. (e.g., looking at the values of variables) Others will be specific 
to DFDL needs. (e.g., looking at layers in the data stream, visualizing 
delimiter scanning success/failure, backtracking)

Core concepts a debugger needs are framing vs. content vs. value, and the 
"regions" in the data stream that make these up. The framing includes 
initiators, terminators, separators, alignment regions, prefix-length regions, 
leading/trailing skip regions, unused regions. Those surround the content 
region, and when padding/filling is involved (for simple types that are 
textual) the content region contains leading pad and trailing pad regions, 
surrounding the value region.

An example of graphical nested box representation of these regions is here in a 
design note about Daffodil:

https://daffodil.apache.org/dev/design-notes/term-sharing-in-schema-compiler/
(see section "Details of Unique and Shared Regions")

The way to start this effort is to look at the UState and PState classes. These 
are the state blocks. Every piece of these is potentially important to the 
debugger.

Lastly, an important aspect of Daffodil is the streaming behavior of the parser 
and unparser. While I believe it is more important to get something working 
than for it to cover every feature, this is an area where not anticipating how 
it needs to work is likely to lock one out of a future scenario that 
accomodates it.

So the parser doesn't produce an infoset. It  produces a stream of infoset 
events, or call-backs to be exact.
Due to backtracking in the parser, these events can be hung-up for substantial 
time while the parser continues. So we can't assume that there is any sort of 
correlation between parser activity and the producing of events.

The unparser doesn't consume an infoset, It consumes a stream of infoset 
events. Specifically, the unparser is the callback-handler for unparse infoset 
events.

The infoset gets trimmed so that we needn't build up the complete infoset tree 
in memory. As parse-events are produced, no-longer necessary parts of the 
infoset are pruned away. Similarly, when unparsing, once a part of the infoset 
has been unparsed, that part of the infoset tree is pruned away if no longer 
needed.



From: Steve Lawrence 
Sent: Thursday, April 22, 2021 9:32 AM
To: dev@daffodil.apache.org 
Subject: Re: The future of the daffodil DFDL schema debugger?

Some thoughts related to showing the infoset as if it were a variable as
this is prototyped

1) How do DAP/IDE's represent very large hierarchical data? Infosets can
be huge, and most of the time a user only cares about the most recent
infoset item. So someway to follow and show just the most recent part of
the infoset is important. The current Daffodil debugger as an
"infosetLines" setting so that it only shows the most recent X number of
lines, which is most all a user cares about when stepping through a parse.

2) Infoset items are added and removed very frequently during a parse.
Currently, when the Daffodil debugger shows the infoset it just converts
the entire thing to XML and displays that. This doesn't work at all for
large infosets since this can take a long time. I was hoping this issue
would get resolved with this new debugging infrastructure. When the
infoset is modified, we ideally want a way to specify via DAP that parts
of the variable hierarchy were added/removed rather than having to send
the entire infoset during every variable update.

3) I can imagine a feature where a user would want to select an infoset
item and jump to the associated schema element, or query information
about that infoset item (e.g.. what bit position did it start at, what
was the length). We don't have this right now, but would be really nice
to have. This suggests that we need metadata associated with each of the
variables. Does DAP have a concept of that and do IDE's have a way to
show it?

On 4/21/21 7:52 PM, Adam Rosien wrote:
> I've been reading up on DAP and wanted to share...
>
>> There are many areas though that are unique to Daffodil that have no
> representation in the spec.  These things (like InputStream, Infoset, PoU,
> different variable types, backtracking, etc) will need an extension to
> DAP.  This really boils down to defining these things to fit under the DAP
>