Thank you Mike - outstanding information!
Recall my data format:
Label: Message
I have 3 perspectives on that format:
Perspective #1: There is a sequence of two strings, separated by a colon.
Perspective #2: There is a label terminated by a colon, followed by a message.
Perspective #3: There is a label, a colon, and a message.
Below I show the element declarations for the 3 perspectives. I am wondering if
I have inlined the correct amount of DFDL stuff, per Best Practice?
Perspective #1: There is a sequence of two strings, separated by a colon.
<xs:element name="really-simple-format">
<xs:complexType>
<xs:sequence dfdl:separator=":" dfdl:separatorPosition="infix">
<xs:element name="label" type="xs:string" />
<xs:element name="message" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
Perspective #2: There is a label terminated by a colon, followed by a message.
<xs:element name="really-simple-format">
<xs:complexType>
<xs:sequence>
<xs:element name="label" type="xs:string" dfdl:terminator=":" />
<xs:element name="message" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
Perspective #3: There is a label, a colon, and a message.
<xs:element name="really-simple-format">
<xs:complexType>
<xs:sequence>
<xs:element name="label" type="xs:string"
dfdl:lengthUnits="characters"
dfdl:lengthKind="pattern"
dfdl:lengthPattern="[\x0D-\xFF]+?(?=[:])" />
<xs:element name="colon" type="xs:string"
dfdl:lengthUnits="characters"
dfdl:lengthKind="explicit"
dfdl:length="1" />
<xs:element name="message" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
For schemas 1 and 2, should I inline more DFDL stuff? For schema 3, do I have
too much DFDL stuff?
/Roger
From: Mike Beckerle <[email protected]>
Sent: Wednesday, November 14, 2018 4:03 PM
To: [email protected]
Subject: Re: Best practice: all inline? no inline? mix?
Yes, of course the mix is best.
So why... well not all properties work the same way.
Delimiters are almost always going to be locally expressed on a model-group or
element.
The escape schemes used for delimiters are almost always going to be defined
centrally, but where they are used, varies with the delimiter - e.g., a
comma-separated sequence might allow escaping a comma embedded in an element
value, but the CRLF at the end of a line might not be allowed to appear in data
at all, i.e., cannot be escaped. So the escape scheme is defined centrally, but
applied locally.
Policy properties like encodingErrorPolicy almost always want to be at top
level.
LengthKind is a property that is quite problematic. I kind of wish we had
complexType lengthKind distinguished from simpleType lengthKind becasue a very
common situation in binary data wants all complex types to have
dfdl:lengthKind="implicit", but simple types to have
dfdl:lengthKind="explicit". And only one of those can be expressed as the
default at top level. So for binary data often every single complex type
element has a dfdl:ref="..." that refers to a named format that has
dfdl:lengthKind="implicit". Maybe. That might not be necessary if the simple
types all have type-definitions and those all say lengthKind="explicit" and are
heavily reused.
Some formats, not all, are reasonably well behaved.
They have mostly similar properties used throughout. E.g., they use a single
text encoding. A single byte order and bit-order, etc.
However, there are plenty of cases where quite diverse data is simply
juxtaposed in a data format, so that characteristics of the data change wildly.
A very common idiom is "envelope and payload" where an envelope, or header,
format is used to augment a payload that is in a quite-different, and perhaps
harder to access, data format. The envelope or header is often byte-aligned,
byte-oriented, and well-behaved i.e., "easy" data, encapsulating a payload that
is bit-oriented, bit aligned, perhaps different bit-order or byte-order,
different numeric representations, etc. NACT headers before Link16 payloads is
a good example of this. MIL-STD-2045 headers before USMTF payloads is sort of
the opposite example. That header format is bit-packed binary, and is quite
complex/challenging, and USMTF is textual and relatively speaking, easier.
DFDL has some properties that you aren't even allowed to put in top-level
scope, because that would *never* make sense. E.g., dfdl:inputValueCalc.
A practice I consider valuable is that the top level <dfdl:format.../>
annotation of a DFDL schema file, should always consist of exactly and only a
reference to a named format.
E.g.,
<xs:annotation><xs:appinfo...>
<dfdl:defineFormat name="xyzFormat2">
<dfdl:format ... all the 'top-level' basic properties for this
format... />
</dfdl:defineFormat>
<dfdl:format ref="tns:xyzFormat2"/> <!-- use the format -->
</xs:appinfo></xs:annotation>
That way you can reuse the format in another file that extends the schema, you
can build variations of it easily, etc.
Another good practice is to put the basic format definition as above here, in
its own DFDL schema file that is imported by the DFDL schema files that
actually define types and elements and groups.
________________________________
From: Costello, Roger L. <[email protected]<mailto:[email protected]>>
Sent: Wednesday, November 14, 2018 2:33:43 PM
To: [email protected]<mailto:[email protected]>
Subject: Best practice: all inline? no inline? mix?
Hello DFDL Community,
I have a simple data format:
Label: Message
Here is a sample input:
Dear Sir: Thank you for your response.
I will take this perspective:
The data format consists of a series of
two strings (label and message),
separated by colon.
Here is the XML that I wish to produce:
<really-simple-format>
<label>Dear Sir</label>
<message>Thank you for your response.</message>
</really-simple-format>
I have identified 3 approaches to design the DFDL schema:
1. Inline the DFDL stuff with the XML schema stuff.
2. Don't inline any DFDL stuff; instead, put all the DFDL stuff at the top,
inside xs:annotation.
3. Inline some DFDL stuff, put some DFDL stuff at the top, inside xs:annotation.
Below I show the 3 approaches. Which is best practice? I know you will say the
third approach (mix approach) is best practice. Okay, then, which DFDL stuff
should be inlined and which should be put at the top, inside xs:annotation?
What is the rationale for how you divvy up the DFDL stuff between inline and at
the top? Do you agree with how I divvied up the DFDL stuff? Would you put more
stuff inline? If so, what other stuff would you put inline?
------------------------------------
All Inline Approach
------------------------------------
<xs:element name="really-simple-format"
dfdl:alignment="implicit"
dfdl:alignmentUnits="bytes"
dfdl:encoding="UTF-8"
dfdl:escapeSchemeRef=""
dfdl:ignoreCase="no"
dfdl:initiator=""
dfdl:leadingSkip="0"
dfdl:lengthKind="delimited"
dfdl:outputNewLine="%CR;%LF;"
dfdl:representation="text"
dfdl:terminator=""
dfdl:textPadKind="none"
dfdl:textTrimKind="none"
dfdl:trailingSkip="0"
dfdl:truncateSpecifiedLengthString="no"
>
<xs:complexType>
<xs:sequence
dfdl:alignment="implicit"
dfdl:alignmentUnits="bytes"
dfdl:encoding="UTF-8"
dfdl:ignoreCase="no"
dfdl:initiatedContent="no"
dfdl:initiator=""
dfdl:leadingSkip="0"
dfdl:lengthKind="delimited"
dfdl:outputNewLine="%CR;%LF;"
dfdl:separator=":"
dfdl:separatorPosition="infix"
dfdl:separatorSuppressionPolicy="never"
dfdl:sequenceKind="ordered"
dfdl:terminator=""
dfdl:trailingSkip="0"
>
<xs:element name="label" type="xs:string"
dfdl:alignment="implicit"
dfdl:alignmentUnits="bytes"
dfdl:encoding="UTF-8"
dfdl:escapeSchemeRef=""
dfdl:ignoreCase="no"
dfdl:initiator=""
dfdl:leadingSkip="0"
dfdl:lengthKind="delimited"
dfdl:outputNewLine="%CR;%LF;"
dfdl:representation="text"
dfdl:terminator=""
dfdl:textPadKind="none"
dfdl:textTrimKind="none"
dfdl:trailingSkip="0"
dfdl:truncateSpecifiedLengthString="no"
/>
<xs:element name="message" type="xs:string"
dfdl:alignment="implicit"
dfdl:alignmentUnits="bytes"
dfdl:encoding="UTF-8"
dfdl:escapeSchemeRef=""
dfdl:ignoreCase="no"
dfdl:initiator=""
dfdl:leadingSkip="0"
dfdl:lengthKind="delimited"
dfdl:outputNewLine="%CR;%LF;"
dfdl:representation="text"
dfdl:terminator=""
dfdl:textPadKind="none"
dfdl:textTrimKind="none"
dfdl:trailingSkip="0"
dfdl:truncateSpecifiedLengthString="no"
/>
</xs:sequence>
</xs:complexType>
</xs:element>
------------------------------------
No Inline Approach
------------------------------------
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:format
alignment="1"
alignmentUnits="bytes"
binaryFloatRep="ieee"
binaryNumberRep="binary"
bitOrder="mostSignificantBitFirst"
byteOrder="bigEndian"
calendarPatternKind="implicit"
documentFinalTerminatorCanBeMissing="yes"
emptyValueDelimiterPolicy="none"
encoding="ISO-8859-1"
encodingErrorPolicy="replace"
escapeSchemeRef=""
fillByte="f"
floating="no"
ignoreCase="no"
initiator=""
initiatedContent="no"
leadingSkip="0"
lengthKind="delimited"
lengthUnits="bits"
nilKind="literalValue"
nilValueDelimiterPolicy="none"
occursCountKind="implicit"
outputNewLine="%CR;%LF;"
representation="text"
separator=":"
separatorPosition="infix"
separatorSuppressionPolicy="never"
sequenceKind="ordered"
terminator=""
textBidi="no"
textNumberCheckPolicy="strict"
textNumberPattern="#,##0.###;-#,##0.###"
textNumberRep="standard"
textNumberRounding="explicit"
textNumberRoundingIncrement="0"
textNumberRoundingMode="roundUnnecessary"
textOutputMinLength="0"
textPadKind="none"
textStandardBase="10"
textStandardExponentRep="E"
textStandardInfinityRep="Inf"
textStandardNaNRep="NaN"
textStandardZeroRep="0"
textStandardGroupingSeparator=","
textTrimKind="none"
trailingSkip="0"
truncateSpecifiedLengthString="no"
utf16Width="fixed"
/>
</xs:appinfo>
</xs:annotation>
<xs:element name="really-simple-format">
<xs:complexType>
<xs:sequence>
<xs:element name="label" type="xs:string" />
<xs:element name="message" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>
------------------------------------
Mix Approach
------------------------------------
<xs:annotation>
<xs:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:format
alignment="1"
alignmentUnits="bytes"
binaryFloatRep="ieee"
binaryNumberRep="binary"
bitOrder="mostSignificantBitFirst"
byteOrder="bigEndian"
calendarPatternKind="implicit"
documentFinalTerminatorCanBeMissing="yes"
emptyValueDelimiterPolicy="none"
encoding="ISO-8859-1"
encodingErrorPolicy="replace"
escapeSchemeRef=""
fillByte="f"
floating="no"
ignoreCase="no"
initiator=""
initiatedContent="no"
leadingSkip="0"
lengthKind="delimited"
lengthUnits="bits"
nilKind="literalValue"
nilValueDelimiterPolicy="none"
occursCountKind="implicit"
outputNewLine="%CR;%LF;"
representation="text"
separatorSuppressionPolicy="never"
sequenceKind="ordered"
terminator=""
textBidi="no"
textNumberCheckPolicy="strict"
textNumberPattern="#,##0.###;-#,##0.###"
textNumberRep="standard"
textNumberRounding="explicit"
textNumberRoundingIncrement="0"
textNumberRoundingMode="roundUnnecessary"
textOutputMinLength="0"
textPadKind="none"
textStandardBase="10"
textStandardExponentRep="E"
textStandardInfinityRep="Inf"
textStandardNaNRep="NaN"
textStandardZeroRep="0"
textStandardGroupingSeparator=","
textTrimKind="none"
trailingSkip="0"
truncateSpecifiedLengthString="no"
utf16Width="fixed"
/>
</xs:appinfo>
</xs:annotation>
<xs:element name="really-simple-format">
<xs:complexType>
<xs:sequence dfdl:separator=":" dfdl:separatorPosition="infix">
<xs:element name="label" type="xs:string" />
<xs:element name="message" type="xs:string" />
</xs:sequence>
</xs:complexType>
</xs:element>