Attached is a note about a feature for DFDL/Daffodil to support base64, folded lines, etc. Formats using these are in high demand in the network-security community.
This memo describes a proposed feature for expressing data stream pre/post processing operations. Most of the discussion here will use parsing as context, but where the unparsing is not clearly symmetric, unparsing will also be described. New DFDL schema annotations are shown in the "daf:" namespace so as to be clear what are DFDL standard, and what the new extensions are. The core concept is a cluster of new properties. * streamEncoding (literal string or DFDL expression) * streamLengthKind (can be explicit, delimited, pattern, endOfParent, prefixed) * streamLength - used for lengthKind 'explicit' * streamLengthUnits (bits or bytes) * streamLengthPattern - used for lengthKind 'pattern' * streamTerminator - (literal string or DFDL expression) - used for lengthKind delimited - not used nor allowed for other length kinds (TBD asymmetric with terminator on a non-delimited element) * streamEscapeSchemeRef - used for lengthKind delimited to escape the streamTerminator when necessary. Those properties are valid on the DFDL annotation elements dfdl:format, dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group. An additional non-format property: * streamTransform This cannot appear on dfdl:format. Only on dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group. Specifying the streamTransform property puts a stream transform into use. The streamTransform property is specifically not allowed on dfdl:format because it is not sensible to put a stream transformation into effect across a lexical scope. Stream transforms apply to the dynamic scope of the Term they are associated with. (This might not work out. It may be of value to define streamTransform in a format, if that format is named, and only referenced from the term that defines the dynamic scope where that stream transform is to be used. If we allow streamTransform on dfdl:format annotations, there are just certain situations where we would want SDE errors to be detected, such as if streamTransform is in lexical scope over a file.) A data stream is conceptually a stream of bytes. It can be an input stream for parsing, an output stream for unparsing. Use of the term "stream" here is consistent with java's use of stream as in InputStream and OutputStream. These are sources and sinks of bytes. If one wants to decode characters from them you must do so by specifying the encoding explicitly. A stream transform is a layering to create one stream of bytes from another. An underlying stream is encapsulated by a transformation to create an overlying stream. When parsing, reading from the overlying stream causes reading of data from the underlying stream, which data is then transformed and becomes the bytes of the overlying stream returned from the read. The stream properties apply to the underlying stream data and indicate how to identify its bounds/length, and if a stream transform is textual, what encoding is used to interpret the underlying bytes. Some transformations are naturally binary bytes to bytes. Data decompress/compress are the typical example here. When parsing, the overlying stream's bytes are the result of decompression of the underlying stream's bytes. If a transform requires text, then a stream encoding must be defined. For example, base64 is a transform that creates bytes from text. Hence, a stream encoding is needed to convert the underlying stream of bytes into text, then the base64 decoding occurs on that text, which produces the bytes of the overlying stream. We think of some transforms as text-to-text. Line folding/unfolding is one such. Lines of text that are too long are wrapped by inserting a line-ending and a space. As a DFDL stream transform this line folding transform requires an encoding. The underlying bytes are decoded into characters according to the encoding. Those characters are divided into lines, and the line unfolding (for parsing) is done to create longer lines of data, the resulting data is then encoded from characters back into bytes using the same encoding. (There may be opportunities to shortcut these transformations if the overlying stream is the data stream for an element with scannable text representation using the same character set encoding.) DFDL can describe a mixture of character set decoding/encoding and binary value parsing/unparsing against the same underlying data representation; hence, the underlying data stream concept is always one of bytes. (TBD: maybe it has to be bits? E.g., in mil-std-2045 headers, the VMF payload data can be compressed. I don't know that this payload data always begins on a byte boundary.) Daffodil parsing begins with a default standard data input stream. Unparsing begins with a default standard output stream. When a DFDL schema wants to describe say, base64 decoding the DFDL annotations might look like this: <element name="foo" daf:streamTransform="base64"> <complexType> <sequence> .... </sequence> </complexType> </element> This annotation means: when parsing element foo, take whatever data stream is in effect, layer a base64 data stream on it, and use that until the end of element foo. The streamEncoding property would be taken from the lexically enclosing format. In this example, when element foo is being parsed, the current data input stream is augmented by being encapsulated in a base64 transformer. This transformer takes the data stream, decodes it to characters using the streamEncoding, then processes the resulting text converting base64 to binary data. The APIs for defining the base64 or other transformers enable one to do these transformations in a streaming manner, on demand as data is pulled from the resulting data stream of bytes. Of course it is possible to just convert the entire data object, but we want to enable streaming behavior in case stream-encoded objects are large. We just have seen how the dfdl:streamEncoding property is used by element foo as part of the dataStream transformation. Let's consider how streamLength works. There are two things we have to describe the length of now. One is the data that is to be transformed. The second is the length of the parsed element taken from the result of the transformation. One may have a base64 encoded region of 1000 bytes streamLength, within that, once decoded one will have only 750 or so bytes available. That data is limited by the 750 length of the decoded data. At the time parsing begins neither of these numbers 1000, nor 750 may be known. <dfdl:defineFormat name="fooStreamFormat"> <dfdl:format streamEncoding="utf-16" streamLengthKind="explicit"/> </dfdl:defineFormat> This data stream will decode utf-16 characters on the underlying data stream, then base64 decode that to get a stream of bytes. <dfdl:defineFormat name="fooFormat"> <dfdl:format ref="tns:fooStreamFormat" encoding="utf-8" byteOrder="bigEndian"/> </dfdl:defineFormat> Then the type <element name="len" type="xs:int".../> <element name="foo" dfdl:ref="tns:fooFormat" type="tns:fooType" dfdl:initiator="foo:" daf:streamLength="{ ../len }" daf:streamTransform="base64"/> Note how the property daf:streamLength is supplied where the expression is relevant, but the other properties controling the stream processing are expressed reusably. In this example, we have that the dfdl:initiator for foo will be decoded in utf-8 characters from the byte-stream produced by the base64 transform. However, that base64 data was decoded from UTF-16 decode of the underlying byte stream. For the unparse direction, this len element needs a dfdl:outputValueCalc. The calculation needs the length of the base64 encoded data. This would be expressed as <element mame="len" type="xs:int" dfdl:outputValueCalc="{ daf:streamLength(../foo, 'bytes') }"/> This function daf:streamLength is much like dfdl:valueLength and dfdl:contentLength, except that it accesses the underlying data stream representation. The units are 'bits', 'bytes' or 'characters'. If 'characters' is specified, then the value returned is the number of characters in the data stream's encoding of the data. In the example above, this would be the number of utf-16 characters in the underlying stream before base64 decoding takes place. ('characters' may not be needed.) If the units are specified as 'bytes' then the length in bytes of the underlying data stream prior to transformation, is provided. ('bits' may or may not be needed, or if provided perhaps we get away with it just being like 'bytes' * 8 and require lengths to be multiple of a byte.) Let's look at an example of two interacting data stream transforms. <xs:sequence daf:streamEncoding="utf-8" daf:streamTransform="foldedLines" daf:streamLengthKind="delimited"> ... ... presumably everything here is textual, and utf-8 because foldedLines only applies sensibly to text. ... <xs:sequence daf:streamEncoding="us-ascii" daf:streamTransform="base64" daf:streamLengthKind="delimited" daf:streamTerminator="{ ../marker }"> ... ... everything here is parsed against the bytes obtained from base64 decoding ... which is itself decoding the output of the foldedLines transform ... above. Base64 requires only us-ascii, which is a subset of utf-8. ... </xs:sequence> </xs:sequence> Summary * allows stacking transforms one on top of another. So you can have base64 encoded compressed data as the payload representation of a child element within a larger element. * allows specifying properties of the underlying data stream separately from the properties of the logical data. * scopes the transforms over a term (model-group or element) * prevents inadvertent lexical scoping of a streamTransform from a lexically enclosing top level format annotation. Implementation Notes: Introduction of a stream transform basically appears in the Term grammar as a combinator that surrounds the contained Term contents.