Best Practice - have it your way: capture unrecognized data or error on it

Beckerle, Mike Tue, 24 Sep 2019 07:53:27 -0700

I decided to write this up for the user list for posterity

Often in DFDL schema we have error cases where data is unhandled by the schema, 
but when the length of the data (e.g., a message perhaps) can still be 
determined.


In that case, where you can still figure out the length, it is common for users 
to want to tolerate erroneous data by capturing the unrecognized data, rather 
than failing to parse it.

The technique to achieve this is to find the top-level primary choice of the 
schema. This choice usually selects the recognized message types by way of 
dfdl:choiceDispatchKey, selecting alternatives or "branches" which contain 
dfdl:choiceBranchKey.

The way you add a "default branch" that is selected if none of the others are, 
is to nest this primary choice inside another choice. This encapsulating choice 
has two branches. One branch is the primary choice as exists. The second branch 
is the "unrecognized" case, and constructs an element to capture that data.

So putting that all together:

<xs:choice>
    <xs:choice dfdl:choiceDispatchKey="{ .... }">
         ... branches for recognized messages....
    </xs:choice>
    <xs:element name="unrecognized" type="xs:hexBinary" 
dfdl:lengthKind="explicit"
                 dfdl:length="{ ....determine the length ... }"/>
</xs:choice>

I've been advocating that people use a DFDL variable to control whether their 
schema causes an error on unrecognized data, or captures it in the style shown 
above. That way one schema can be used both ways.

To implement the variable for control of error vs. capture unrecognized 
messages, you would just add an assert to the element above, which fails if the 
variable is set to cause errors, and passes if it is set to cause capture as 
elements. Here's the whole thing:

<xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl";>
   <dfdl:defineVariable name="captureUnrecognizedMessages" type="xs:boolean"
                                    defaultValue="true" external="true"/>
</xs:appinfo></xs:annotation>

<xs:choice>
    <xs:choice dfdl:choiceDispatchKey="{ .... }">
         ... branches for recognized messages....
    </xs:choice>
    <xs:sequence> <!-- handle unrecognized message -->
       <xs:sequence>
          <!--
              discriminator true so we lock in that this branch *is* going to 
be selected.
              Subsequently, if the assert below fails, that specific diagnostic 
message will be issued,
              this choice will not backtrack and issue some non-descript
              "all choices failed" message.
            -->
          <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl";>
              <dfdl:discriminator>{ fn:true() }</dfdl:discriminator>
          </xs:appinfo></xs:annotation>
       </xs:sequence>
       <!--
             Element to capture unrecognized data. Captures, or assert fails 
with
             diagnostic message.
          -->
       <xs:element name="unrecognized" type="xs:hexBinary" 
dfdl:lengthKind="explicit"
                 dfdl:length="{ ....determine the length ... }">
          <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl";>
              <!--
                   This assert passes if we're capturing unrecognized messages
                   fails, and issues diagnostic message otherwise.

                   Note that the message can be an expression which would 
include
                   identifying message ID. You just have to be certain that 
message expression
                   will always succeed to evaluate.
                 -->
              <dfdl:assert message="unrecognized message type">{
                     $tns:captureUnrecognizedMessages
               }</dfdl:assert>
          </xs:appinfo></xs:annotation>
       </xs:element>
    </xs:sequence>
</xs:choice>

Keep in mind if you define a DFDL schema to "recognize" unhandled messages and 
parse them into hexBinary elements of undifferentiated bytes, then as far as 
the DFDL schema is concerned that data is valid. So you need some separate 
capability in the system to flag when these unhandled elements are being 
created so they don't walk through your entire system as if they are valid.

Best Practice - have it your way: capture unrecognized data or error on it

Reply via email to