In some very unorthodox use of DFDL & Daffodil, I needed to ensure that I was 
able to get xml output, even from files that contained extra data after the 
last piece of parsable data.
I accomplished this by adding a "dataBlob" element that consumed everything up 
to the end of the file:

    <xs:element name="DataBlob"  type="xs:hexBinary" dfdl:lengthKind="pattern" 
dfdl:lengthPattern="[\x00-\xFF]*?(?=\xFF+[\x01-\xFE])" 
dfdl:encoding="ISO-8859-1">
        <xs:annotation>
            <xs:appinfo source="http://www.ogf.org/dfdl/";>
                <dfdl:discriminator test="{ dfdl:valueLength(., 'bytes') gt 0 
}" />
            </xs:appinfo>
        </xs:annotation>
    </xs:element>

As utilized in a modification of the sample jpeg schema, I added this element 
inside of the xs:choice inside of the "Segment" element:

                                <xs:choice>
                                    <xs:element ref="DataBlob" />
                                    <xs:group ref="Markers" />
                                    <xs:element ref="Datablob" />
                                </xs:choice>

I also made use of similar logic to ensure that a file parse would complete 
even if the given length of an element was larger than the actual length of 
data remaining in the file:

    <xs:element name="packet_truncated">
        <xs:complexType>
            <xs:sequence>
                <xs:element name="Datablob" 
                            type="xs:hexBinary" 
                            dfdl:lengthKind="pattern" 
                            dfdl:lengthPattern="[\x00-\xFF]+$" 
                            dfdl:encoding="ISO-8859-1"
                            dfdl:outputValueCalc="{xs:hexBinary('00')}" >
                    <xs:annotation>
                        <xs:appinfo source="http://www.ogf.org/dfdl/";>
                            <dfdl:discriminator test="{ dfdl:valueLength(., 
'bytes') gt 0 }" />
                        </xs:appinfo>
                    </xs:annotation>
                </xs:element>
            </xs:sequence>
        </xs:complexType>
    </xs:element>

This required layering an xs:choice into each of the elements. As an example 
here is the modified SOF element:

          <xs:complexType name="SOF">
 *          <xs:choice>
                <xs:sequence>
                    <xs:element name="Length" type="unsignedint16" 
dfdl:outputValueCalc="{ 6 + (3 * 
../Number_of_Source_Image_Components_in_the_Frame) + 2}"/>
                    <xs:element name="Precision" type="unsignedint8"/>
                    <xs:element name="Number_of_Lines_in_Source_Image" 
type="unsignedint16"/>
                    <xs:element name="Number_of_Samples_per_Line" 
type="unsignedint16"/>
                    <xs:element 
name="Number_of_Source_Image_Components_in_the_Frame" type="unsignedint8" 
dfdl:outputValueCalc="{ fn:count(../Image_Components_in_Frame/Image_Component) 
}"/>
                    <xs:element name="Image_Components_in_Frame" 
dfdl:lengthKind="explicit" dfdl:lengthUnits="bytes" dfdl:length="{3 * 
../Number_of_Source_Image_Components_in_the_Frame}">
                        <xs:complexType>
                            <xs:sequence>
                                <xs:element name="Image_Component" 
maxOccurs="unbounded" dfdl:occursCountKind="implicit">
                                    <xs:complexType>
                                        <xs:sequence>
                                            <xs:element 
name="Component_Identifier" type="unsignedint8"/>
                                            <xs:element 
name="Horizontal_Sampling_Factor" type="unsignedint4"/>
                                            <xs:element 
name="Vertical_Sampling_Factor" type="unsignedint4"/>
                                            <xs:element 
name="Quantization_Table_Selector" type="unsignedint8"/>
                                        </xs:sequence>
                                    </xs:complexType>
                                </xs:element>
                            </xs:sequence>
                        </xs:complexType>
                    </xs:element>
                </xs:sequence>
  *            <xs:element ref="packet_truncated" />
  *        </xs:choice>
          </xs:complexType>

I have been able to use this technique to create a schema that will allow 
Daffodil to cleanly exit and produce an output xml file with virtually any jpeg 
file - no matter how badly corrupted it is.
However, if Daffodil were modified to flag an error, but still produce the 
parsed portion of the file, it would allow the schema remain simpler and easier 
to read.


-----Original Message-----
From: Mike Beckerle <mbecke...@apache.org> 
Sent: Thursday, April 14, 2022 2:27 PM
To: dev@daffodil.apache.org
Subject: idea for helping with "left over data error"

Please comment on this idea.

The problem is that users write a schema, get "left over data" when they test 
it. The schema works.  The schema is, as far as DFDL and Daffodil is concerned, 
correct. It just doesn't express what you intended it to express. It IS a 
correct schema, just not for your intended format.


I think Daffodil needs to save the "last failure" purely for the case where 
there is left-over data. Daffodil is happily ending the parse successfully but 
reporting it did not consume all data.


In some applications where you are consuming messages from a network socket 
which is a byte stream, this is 100% normal behavior (and no left-over-data 
error would or should be issued.)


In tests and anything that is "file format" oriented, left-over data is a real 
error. So the fact that Daffodil/DFDL says the parse ended normally without 
error isn't helping.


In DFDL, a variable-occurrences array, the number-of-occurrences of which is 
determined by the data itself, always is ended if a parse fails. So long as 
maxOccurs has not been reached, the parse attempts another array element, and 
if it fails, it *suppresses that error*, backs up to the end of the prior array 
element (or start of array if there are no elements at all), and *discards the 
failure information*, then goes on to parse "the rest of the schema" meaning 
the stuff after the array.


But what if nothing is after the array?


The "suppress the error" and "discard the failure" above,.... those are a 
problem, because if the parse ends with left-over data, those are the "last 
error before the parse ended", and those *may* be relevant to why all the data 
was not consumed.


I think we need to preserve the failure information a bit longer than we are.


So with that problem in mind here's a possible mechanism to provide better 
diagnostics.


Maybe instead of deleting it outright we put it on a queue of depth N (shallow, 
like 1 or 2), and as we put more failure info on that queue the failure info it 
pushes out the other end is discarded, but at end of processing you can look 
back in the parser state and see what the last N failures were, and hopefully 
you find there the reason for the last array ending early.?


N could be set quite deep for debugging/schema-development, so you can look 
back through it and see the backtracking decisions in reverse chronological 
order as far as you need.


Comments? Variants? Alternatives?

Reply via email to