Re: The future of the daffodil DFDL schema debugger?

2021-01-06 Thread Sloane, Brandon
I agree with Steve that "ideal" debugger would involve a rich multi-pane GUI; 
and like the idea of establishing a well defined protocol to isolate the GUI 
from the main codebase. Using a protocol would also give us scritability for 
nearly free, as users could leverage normal shell tools to script whatever 
debug automation/conveniences they need.

I'm not sure how well existing debugger protocols would work though (although 
if they do fit, it would save us a lot of effort). The type of debugging needed 
for Daffodil schema strikes me as fairly distinct from what you would typically 
expect from most debuggers.

On the subject of functionality, one feature that I would really like to see 
added is time travel. With the work we already do to support backtracking, it 
should be relatively simple to add support for fully restoring the parse state 
to a prior saved state; which would be a massive QoL improvement for the 
interactive debugger.

For the non-interactive tracer (and, to some extent the interactive debugger), 
I think we may need to support varying levels of verbosity. In addition to a 
global verbosity level, we should also have some way to flag specific "things" 
to get more or less details. Speciying exactly what a "thing" is is its own 
discussion, as even a simple type in the schema can end up having many 
different regions (prefix, suffix, padding, etc).

At a high level, I think I see 2 ways forward:

1) Mike's suggestion: make incremental improvements to our existing tooling, 
focusing primarily on reducing the volume information the user is exposed to.

2) Steve's idea: establish a debugging protocol and develop an external 
debugger.

I would add to 2 that we can develop an experimental debugger to play around 
with different design ideas much easier than we could if the debugger were 
itself part of Daffodil proper. Since I don't think we have a solid idea of 
what this debugger looks like, I think this is valuable.

Additionally, even if we use our own non-standard protocol, implementing (2) 
would still make it far easier for someone to integrate Daffodil debugging 
facilities into third party applications.

In my mind, this is entirely a question of engineering effort. If we are trying 
to improve the debugger, (1) is a must have at least in the sense of improving 
the output of --trace, as that non-interactive interface is simple enough to be 
quickly usable in almost any configuration. Having said that, if we are going 
to do (2), we should do it first, as it would probably simplify the work needed 
for (1)

If we have the resources, (2) would result in a far superior product.

Additionally, I think the work needed for (2) could have benifits beyond simply 
debugging. I have wanted for a while a tool similar to Wireshark's dissectors: 
where we could provide a schema then see the binary data and infoset 
side-by-side and see how regions of the two map to each other.

From: Steve Lawrence 
Sent: Wednesday, January 6, 2021 1:42 PM
To: dev@daffodil.apache.org 
Subject: Re: The future of the daffodil DFDL schema debugger?

Yep, something like that seems very reasonable for dealing with large
infosets. But it still feels like we still run into usability issues.
For example, what if a user wants to see more? We need some
configuration options to increase what we've ellided. It's not big, but
every new thing that needs configuration adds complexity and decreases
usability.

And I think the only reason we are trying to spend effort elliding
things is because we're limited to this gdb-like interface where you can
only print out a little information at a time.

I think what would really is to dump this gdb interface and instead use
multiple windows/views. As a really close example to what I imagine, I
recently came across this hex editor:

https://www.synalysis.net/

The screenshots are a bit small so it's not super clear, but this tool
has one view for the data in hex, and one view for a tree of parsed
results (which is very similar to our infoset). The "infoset" view has
information like offset/length/value, and can be related back to the
data view to find the actual bits.

I imagine the "next generation daffodil debugger" to look much like
this. As data is parsed, the infoset view fills up. This view could act
like a standard GUI tree so you could collapse sections or scroll around
to show just the parts you care about, and have search capabilities to
quickly jump around. The advantage here is you no longer really need
automated eliding or heuristics for what the user *might* care about.
You just show the whole thing and let user scroll around. As daffodil
parses and backtracks, this tree grows or shrinks.

I also imagine you could have a cursor moving around the hex view, so as
daffodil moves around (e.g. scanning for delimiters, extracting
integers), one could update this data view to show what daffodil is
doing and where it is.

I also image there 

Re: The future of the daffodil DFDL schema debugger?

2021-01-06 Thread Steve Lawrence
Yep, something like that seems very reasonable for dealing with large
infosets. But it still feels like we still run into usability issues.
For example, what if a user wants to see more? We need some
configuration options to increase what we've ellided. It's not big, but
every new thing that needs configuration adds complexity and decreases
usability.

And I think the only reason we are trying to spend effort elliding
things is because we're limited to this gdb-like interface where you can
only print out a little information at a time.

I think what would really is to dump this gdb interface and instead use
multiple windows/views. As a really close example to what I imagine, I
recently came across this hex editor:

https://www.synalysis.net/

The screenshots are a bit small so it's not super clear, but this tool
has one view for the data in hex, and one view for a tree of parsed
results (which is very similar to our infoset). The "infoset" view has
information like offset/length/value, and can be related back to the
data view to find the actual bits.

I imagine the "next generation daffodil debugger" to look much like
this. As data is parsed, the infoset view fills up. This view could act
like a standard GUI tree so you could collapse sections or scroll around
to show just the parts you care about, and have search capabilities to
quickly jump around. The advantage here is you no longer really need
automated eliding or heuristics for what the user *might* care about.
You just show the whole thing and let user scroll around. As daffodil
parses and backtracks, this tree grows or shrinks.

I also imagine you could have a cursor moving around the hex view, so as
daffodil moves around (e.g. scanning for delimiters, extracting
integers), one could update this data view to show what daffodil is
doing and where it is.

I also image there could be other views as well. For example, a schema
view to show where in the schema daffodil is, and to add/remove
breakpoints. And an information view for things like variables, in-scope
delimiters, PoU's, etc.

The only reason I mention a debug protcol is that would allow this GUI
to be more easily written in something other that Java/Scala to take
advantage of other GUI toolkits. It's been a long while since I've done
anything with Java guis, but they seems pretty poor that last I looked
at them. Would even allow for a TUI, which Java has little/no support
for. Also enables things like remote deubgging if an socket IPC was
used. Though I'm not sure all of that is necessary. Just thinking what
would be ideal, and it can always be pared back.


On 1/6/21 12:44 PM, Beckerle, Mike wrote:
> I don't think of it as a daffodil debug protocol, but just a separation of 
> concerns between display of information and the behaviors of parse/unparse 
> that need to be points where users can pause, and data structures available 
> to display.
> 
> E.g., it is 100% a display issue that the infoset (shown as XML) is clumsy, 
> too big, etc.  The infoset is available in the processor state, and one can 
> examine the current node, enclosing node, prior sibling(s), following 
> sibling(s), etc. One can elide contents that are too big for hexBinary, etc.
> 
> I think this problem, how to display the infoset with sensible limits on 
> sizing, is fairly easy to come up with some design for, that will at least be 
> (1) always fairly small (2) much more useful in more cases. It won't be 
> perfect but can be much better than what we do now.
> 
> One sensible display "mode" should be that displaying the context surrounding 
> the current element (when parsing or unparsing) displays at most N-lines. 
> (N/2 before, N/2 after) with a maximum length of L characters (settable 
> within reason ?)
> 
> Sibling and enclosing nodes would be displayed eliding their contents to at 
> most 1 line.
> 
> Here's an example of what I mean. Displaying up to M=10 lines total:
> 
> ...
> 
>...
>89ab782 ...
>some text is here and some more text
>value might be some big thing which needs to be elided 
> ...
> ... 
>???
> 
> ???
> 
> The  is just an idea to reduce XML matching end-tag clutter.
> 
> The ... on a line alone or where element content would appear generally means 
> 1 or more other siblings. The way the display above starts with ... means 
> that this is a relative inner nest, not starting from the absolute root.
> 
> The ... within simple content means that content is elided to fit on one 
> line. Always follows some text characters to differentiate from the 
> child-element context.
> 
> The ??? means zero or more other siblings.
> 
> I used bold italic above to point out that the current node would be 
> highlighted somehow. Probably a way to do this that doesn't require display 
> modes would be useful. E.g., a text marker like ">>>" as in:
> 
 value  
> 
> might be better, particularly for a trace output being dumped to a text file.
> 
> I made the above example an 

Re: The future of the daffodil DFDL schema debugger?

2021-01-06 Thread Beckerle, Mike
I don't think of it as a daffodil debug protocol, but just a separation of 
concerns between display of information and the behaviors of parse/unparse that 
need to be points where users can pause, and data structures available to 
display.

E.g., it is 100% a display issue that the infoset (shown as XML) is clumsy, too 
big, etc.  The infoset is available in the processor state, and one can examine 
the current node, enclosing node, prior sibling(s), following sibling(s), etc. 
One can elide contents that are too big for hexBinary, etc.

I think this problem, how to display the infoset with sensible limits on 
sizing, is fairly easy to come up with some design for, that will at least be 
(1) always fairly small (2) much more useful in more cases. It won't be perfect 
but can be much better than what we do now.

One sensible display "mode" should be that displaying the context surrounding 
the current element (when parsing or unparsing) displays at most N-lines. (N/2 
before, N/2 after) with a maximum length of L characters (settable within 
reason ?)

Sibling and enclosing nodes would be displayed eliding their contents to at 
most 1 line.

Here's an example of what I mean. Displaying up to M=10 lines total:

...

   ...
   89ab782 ...
   some text is here and some more text
   value might be some big thing which needs to be elided ...
... 
   ???

???

The  is just an idea to reduce XML matching end-tag clutter.

The ... on a line alone or where element content would appear generally means 1 
or more other siblings. The way the display above starts with ... means that 
this is a relative inner nest, not starting from the absolute root.

The ... within simple content means that content is elided to fit on one line. 
Always follows some text characters to differentiate from the child-element 
context.

The ??? means zero or more other siblings.

I used bold italic above to point out that the current node would be 
highlighted somehow. Probably a way to do this that doesn't require display 
modes would be useful. E.g., a text marker like ">>>" as in:

>>> value  

might be better, particularly for a trace output being dumped to a text file.

I made the above example an unparser kind of example by showing a following 
sibling that exists that is after the current node.

I think the key concept is that any sibling node is displayed in a way that 
fits on one line.
E.g., even if the element name was really long, I'd suggest:

  abcd ... 

Where the element name itself gets elided because it is too long.

A thought. Note that the above presentation is shown as quasi-XML, but there's 
nothing XML-specific about it. A JSON-friendly equivalent could be done as well:

enclosingParent1 = {
   ...
   priorSibling2 = "89ab782..."
   priorSibling1 = "some text is here and some more text"
   currentNode = "value might be some big thing which needs to be elided ..."
   followingSibling1 = { ... }
   ???
}

That's enough for 1 email thread on this debug topic.



From: Steve Lawrence 
Sent: Tuesday, January 5, 2021 2:26 PM
To: dev@daffodil.apache.org 
Subject: The future of the daffodil DFDL schema debugger?


Now that we're in a new year, I'd like to start a discussion about the
Daffodil DFDL Schema debugger and how it might be improved to be more
useful.

Note that this is not the capabilities to debug Daffodil itself in
something like Eclipse/IntelliJ, but the ability for Daffodil to provide
enough extra information during a parse/unparse so that a schema
developer can get an idea of what Daffodil is doing. This makes it
easier for users (rather than developers) to determine why a schema
isn't giving the expect parse/unparse result (either because of bad data
or a faulty schema.

The current state of the debugger is enabled by providing the --debug or
--trace flags in the CLI. More information about that here:

https://daffodil.apache.org/debugger/

This enables a TUI and commands somewhat similar to GDB, providing thins
like breakpoints, steps, displaying the current infoset, display a dump
of the data, etc.

Although I find this tool pretty useful, it definitely has some glaring
issues.

The most glaring to me is that it really isn't useful at all for
debugging unparse. The data dumps only include then main outputstream,
so determine things like suspensions and buffered output is impossible.

Another issue is the infoset output. When outputting the infoset, the
debugger currently just walks the entire thing and converts it to XML
and displays the XML. For large infosets, this is excess and can make it
impossible to use, even with some configurations the limit how much of
that infoset is actually printed to the screen. Also things like large
hex binary blobs create excessive and unusable output.

Another thing I feel is missing is a schema view. Right now it's very
difficult to know where in the schema Daffodil actually is.

I think these issues just need some thought 

Re: Embedded Schematron progress

2021-01-06 Thread John Wass
The schema and tests for BMP/GIF/JPEG were moved into branches on those
DFDLSchemas repos.  After this PR is merged and a the next release is
pubished those tests could be added to each of those repos.  I suppose the
embedded schematron schema could merged any time without the tests.  Those
repos would be a good context to continue and resolve the "best practices
in the schematron" discussions.

On Tue, Dec 22, 2020 at 9:53 AM John Wass  wrote:

> > The second one is similar to examples in the GIF schema
> .
> That schema can be added in the PR unit tests, to go along with the BMP and
> JPEG.
>
> Added the gif schema to the tests, looking good.  Specifically looked at
> rule `count(/GIF/Global_Color_Table/RGB) eq math:pow(2,
> ../number(Size_of_Global_Color_Table) + 1)`.
>
> Working on embedding the bmp schema now as the final integration test.
>
>
> On Mon, Dec 21, 2020 at 7:49 AM John Wass  wrote:
>
>> > Does the process create SVRL files when it completes?
>>
>> No, the svrl is consumed and converted into Daffodil diagnostics.
>>
>>
>> >  Is there a commandline option to direct the SVRL file to a specific
>> path and name?
>>
>> It doesn't, but is a good idea and certainly could.  Passing a flag
>> through the validator config could trigger writing the file.
>>
>> Probably be in a follow up PR.
>>
>>
>> > I'm curious of those type of tests will work with this process.
>>
>> They should.  The first can be checked in a unit test that matches a
>> byte.  The second one is similar to examples in the GIF schema
>> .
>> That schema can be added in the PR unit tests, to go along with the BMP and
>> JPEG.
>>
>>
>>
>>
>> On Fri, Dec 18, 2020 at 2:43 PM Rege Nteligen 
>> wrote:
>>
>>> I took a look at the sample xsd's with the imbedded schematron asserts.
>>> It looks good.  Does the process create SVRL files when it completes?  Is
>>> there a commandline option to direct the SVRL file to a specific path and
>>> name?
>>>
>>> I was recently working with a modified daffodil GIF schema and
>>> schematron to report various findings with GIF files.  Several test
>>> involved testting that keyword were not in HEX blob fields.  I'm curious of
>>> those type of tests will work with this process.  This is a sample assert:
>>>  
>>> GIF: FAIL: LSD_Blob: AFTER-HDR-REF-SQL: Possible
>>> malicious SQL reference between segemnts
>>> 
>>>
>>> I've also done test to see if the count of bytes in one field matched
>>> the size of the field value from another field:
>>> 
>>> GIF: RED: LSG_GCL: GCL-RGB-CNT: There must be
>>> Size_of_Global_Color_Table RGB values.
>>> 
>>>
>>>
>>>
>>> On 2020/12/18 17:21:02, John Wass  wrote:
>>> > The Embedded Schematron PR is moving along, hoping to get it out of WIP
>>> > soon.  https://github.com/apache/incubator-daffodil/pull/463
>>> >
>>> > The JPEG and BMP schema repos are being used for testing now, and the
>>> PNG
>>> > looks like it would provide some great coverage.. maybe too great :/
>>> Any
>>> > other noteworthy sources of sch+data that might be beneficial to test
>>> with?
>>> >
>>> > Observations on embedding
>>> > - Behavior has been predictable, and errors have been clear
>>> > - There are multiple placement options for schematron rules in a schema
>>> > - The Validator API has held up well, but might be one issue to come
>>> out of
>>> > this effort
>>> >
>>> > Examples at
>>> >
>>> https://github.com/jw3/incubator-daffodil/tree/validator_spi/embedded_schematron/daffodil-schematron/src/test/resources/xsd
>>> >
>>>
>>