Re: idea for helping with "left over data error"

Steve Lawrence Thu, 14 Apr 2022 12:52:17 -0700

That doesn't seem unreasonable to me, but here's some counter exampleswhere I think this approach won't help, maybe something to consider:

1) Imagine a simple schema that parses a single byte with no point ofuncertainties, and the input data was two bytes. In this case, therewill be no parse errors to show the user since everything parsed exactlyas expected, yet there is still left over data. This change won't helpthis case at all. But this is maybe trivial and pretty unlikely.

But more generically and maybe more common is using the wrong length fora field. This will make things quickly go off the rails, and will notgenerate a parse error related to that length. And if we show anyfollowing parse errors, they will only be misleading.


2) Say we have a schema like this:

  <element name="elem1" minOccurs="0" />
  <element name="elem2" minOccurs="0" />
  ...
  <element name="elemN" minOccurs="0" />

And say we fail to parse elem2 because our schema is broken. It'soptional so we just continue on. And it's likely that everything afterthat is going to fail as well. No big deal, it's all optional. But thismeans we'll have parse errors for every elem after elem2. The one weactually care about is waaaay back at the beginning of the parse. But wedon't know that is where things went off the rails. To make mattersworse, imagine that elem1 was actually not in the data. So we'd get aparse error for every element, and only the one for elem2 is actuallyuseful. There's just no way we can know that and suggest it to the user.

And like the first case, showing additional parse errors might beconfusing or and misleading. In this case, we'll get a slew of parseerrors that's going to be overwhelming. And if we show only the few mostrecent errors, the user will focus all their energy looking at why elemNor elemN - 1 are failing to parse, when really the issue happened waaaayback at elem2.

I imagine this kind of things would be pretty common for these left overdata errors. Something fails early on that is the real error, but abunch of optional/PoU things follow it and also fail which leads to leftover data. And showing one or more parser errors may not help the userknow which one to focus on, especially since not all parse errorssignify a problem.

I wonder if improvements to the VScode debugger would help the most?With the issue of left over data, we do get an infoset. If we couldvisually overlay that over the actual data in the debugger it wouldprobably make it very clear where things start going wrong focus theuser to the right part of the schema.



On 4/14/22 2:27 PM, Mike Beckerle wrote:

Please comment on this idea.

The problem is that users write a schema, get "left over data" when they
test it. The schema works.  The schema is, as far as DFDL and Daffodil is
concerned, correct. It just doesn't express what you intended it to
express. It IS a correct schema, just not for your intended format.


I think Daffodil needs to save the "last failure" purely for the case where
there is left-over data. Daffodil is happily ending the parse successfully
but reporting it did not consume all data.


In some applications where you are consuming messages from a network socket
which is a byte stream, this is 100% normal behavior (and no left-over-data
error would or should be issued.)


In tests and anything that is "file format" oriented, left-over data is a
real error. So the fact that Daffodil/DFDL says the parse ended normally
without error isn't helping.


In DFDL, a variable-occurrences array, the number-of-occurrences of which
is determined by the data itself, always is ended if a parse fails. So long
as maxOccurs has not been reached, the parse attempts another array
element, and if it fails, it *suppresses that error*, backs up to the end
of the prior array element (or start of array if there are no elements at
all), and *discards the failure information*, then goes on to parse "the
rest of the schema" meaning the stuff after the array.


But what if nothing is after the array?


The "suppress the error" and "discard the failure" above,.... those are a
problem, because if the parse ends with left-over data, those are the "last
error before the parse ended", and those *may* be relevant to why all the
data was not consumed.


I think we need to preserve the failure information a bit longer than we
are.


So with that problem in mind here's a possible mechanism to provide better
diagnostics.


Maybe instead of deleting it outright we put it on a queue of depth N
(shallow, like 1 or 2), and as we put more failure info on that queue the
failure info it pushes out the other end is discarded, but at end of
processing you can look back in the parser state and see what the last N
failures were, and hopefully you find there the reason for the last array
ending early.?


N could be set quite deep for debugging/schema-development, so you can look
back through it and see the backtracking decisions in reverse chronological
order as far as you need.


Comments? Variants? Alternatives?

Re: idea for helping with "left over data error"

Reply via email to