Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Mike Beckerle Fri, 22 Dec 2017 10:21:07 -0800

Attached is a note about a feature for DFDL/Daffodil to support base64, folded 
lines, etc. Formats using these are in high demand in the network-security 
community.

This memo describes a proposed feature for expressing data stream pre/post 
processing operations.

Most of the discussion here will use parsing as context, but where the 
unparsing is not clearly symmetric, unparsing will also be described.

New DFDL schema annotations are shown in the "daf:" namespace so as to be clear 
what are DFDL standard, and what the new extensions are.


The core concept is a cluster of new properties.

* streamEncoding (literal string or DFDL expression)
* streamLengthKind (can be explicit, delimited, pattern, endOfParent, prefixed) 
* streamLength - used for lengthKind 'explicit'
* streamLengthUnits (bits or bytes)
* streamLengthPattern - used for lengthKind 'pattern'
* streamTerminator - (literal string or DFDL expression) - used for lengthKind 
delimited - not used nor allowed for other length kinds (TBD asymmetric with 
terminator on a non-delimited element)
* streamEscapeSchemeRef - used for lengthKind delimited to escape the 
streamTerminator when necessary.

Those properties are valid on the DFDL annotation elements dfdl:format, 
dfdl:element, dfdl:simpleType, dfdl:sequence, dfdl:choice, dfdl:group.

An additional non-format property:

* streamTransform

This cannot appear on dfdl:format. Only on dfdl:element, dfdl:simpleType, 
dfdl:sequence, dfdl:choice, dfdl:group.

Specifying the streamTransform property puts a stream transform into use. The 
streamTransform property is specifically not allowed on dfdl:format because it 
is not sensible to put a stream transformation into effect across a lexical 
scope. Stream transforms apply to the dynamic scope of the Term they are 
associated with.

(This might not work out. It may be of value to define streamTransform in a 
format, if that format is named, and only referenced from the term that defines 
the dynamic scope where that stream transform is to be used. If we allow 
streamTransform on dfdl:format annotations, there are just certain situations 
where we would want SDE errors to be detected, such as if streamTransform is in 
lexical scope over a file.)

A data stream is conceptually a stream of bytes. It can be an input stream for 
parsing, an output stream for unparsing.
Use of the term "stream" here is consistent with java's use of stream as in 
InputStream and OutputStream. These are sources and sinks of bytes. If one 
wants to decode characters from them you must do so by specifying the encoding 
explicitly.

A stream transform is a layering to create one stream of bytes from another. An 
underlying stream is encapsulated by a transformation to create an overlying 
stream.

When parsing, reading from the overlying stream causes reading of data from the 
underlying stream, which data is then transformed and becomes the bytes of the 
overlying stream returned from the read.

The stream properties apply to the underlying stream data and indicate how to 
identify its bounds/length, and if a stream transform is textual, what encoding 
is used to interpret the underlying bytes.

Some transformations are naturally binary bytes to bytes. Data 
decompress/compress are the typical example here. When parsing, the overlying 
stream's bytes are the result of decompression of the underlying stream's bytes.

If a transform requires text, then a stream encoding must be defined. For 
example, base64 is a transform that creates bytes from text. Hence, a stream 
encoding is needed to convert the underlying stream of bytes into text, then 
the base64 decoding occurs on that text, which produces the bytes of the 
overlying stream.

We think of some transforms as text-to-text. Line folding/unfolding is one 
such. Lines of text that are too long are wrapped by inserting a line-ending 
and a space. As a DFDL stream transform this line folding transform requires an 
encoding. The underlying bytes are decoded into characters according to the 
encoding. Those characters are divided into lines, and the line unfolding (for 
parsing) is done to create longer lines of data, the resulting data is then 
encoded from characters back into bytes using the same encoding.

(There may be opportunities to shortcut these transformations if the overlying 
stream is the data stream for an element with scannable text representation 
using the same character set encoding.)

DFDL can describe a mixture of character set decoding/encoding and binary value 
parsing/unparsing against the same underlying data representation; hence, the 
underlying data stream concept is always one of bytes.

(TBD: maybe it has to be bits? E.g., in mil-std-2045 headers, the VMF payload 
data can be compressed. I don't know that this payload data always begins on a 
byte boundary.)

Daffodil parsing begins with a default standard data input stream. Unparsing 
begins with a default standard output stream.

When a DFDL schema wants to describe say, base64 decoding the DFDL annotations 
might look like this:

<element name="foo" daf:streamTransform="base64">
  <complexType>
    <sequence>
      ....
    </sequence>
  </complexType>
</element>

This annotation means: when parsing element foo, take whatever data stream is 
in effect, layer a base64 data stream on it, and use that until the end of 
element foo. The streamEncoding property would be taken from the lexically 
enclosing format. 

In this example, when element foo is being parsed, the current data input 
stream is augmented by being encapsulated in a base64 transformer. This 
transformer takes the data stream, decodes it to characters using the 
streamEncoding, then processes the resulting text converting base64 to binary 
data.

The APIs for defining the base64 or other transformers enable one to do these 
transformations in a streaming manner, on demand as data is pulled from the 
resulting data stream of bytes. Of course it is possible to just convert the 
entire data object, but we want to enable streaming behavior in case 
stream-encoded objects are large.

We just have seen how the dfdl:streamEncoding property is used by element foo 
as part of the dataStream transformation.

Let's consider how streamLength works.

There are two things we have to describe the length of now. One is the data 
that is to be transformed. The second is the length of the parsed element taken 
from the result of the transformation.

One may have a base64 encoded region of 1000 bytes streamLength, within that, 
once decoded one will have only 750 or so bytes available. That data is limited 
by the 750 length of the decoded data. At the time parsing begins neither of 
these numbers 1000, nor 750 may be known. 

<dfdl:defineFormat name="fooStreamFormat">
  <dfdl:format streamEncoding="utf-16" streamLengthKind="explicit"/>
</dfdl:defineFormat>

This data stream will decode utf-16 characters on the underlying data stream, 
then base64 decode that to get a stream of bytes.

<dfdl:defineFormat name="fooFormat">
  <dfdl:format ref="tns:fooStreamFormat" encoding="utf-8" 
byteOrder="bigEndian"/>
</dfdl:defineFormat>

Then the type 

<element name="len" type="xs:int".../>
<element name="foo" dfdl:ref="tns:fooFormat" type="tns:fooType" 
dfdl:initiator="foo:"
         daf:streamLength="{ ../len }" daf:streamTransform="base64"/>

Note how the property daf:streamLength is supplied where the expression is 
relevant, but the other properties controling the stream processing are 
expressed reusably.

In this example, we have that the dfdl:initiator for foo will be decoded in 
utf-8 characters from the byte-stream produced by the base64 transform. 
However, that base64 data was decoded from UTF-16 decode of the underlying byte 
stream. 

For the unparse direction, this len element needs a dfdl:outputValueCalc. The 
calculation needs the length of the base64 encoded data.

This would be expressed as

<element mame="len" type="xs:int" dfdl:outputValueCalc="{ 
daf:streamLength(../foo, 'bytes') }"/>

This function daf:streamLength is much like dfdl:valueLength and 
dfdl:contentLength, except that it accesses the
underlying data stream representation. The units are 'bits', 'bytes' or 
'characters'. If 'characters' is specified, then the value returned is the
number of characters in the data stream's encoding of the data. In the example 
above, this would be the number of utf-16 characters
in the underlying stream before base64 decoding takes place.

('characters' may not be needed.)

If the units are specified as 'bytes' then the length in bytes of the 
underlying data stream prior to transformation, is provided.

('bits' may or may not be needed, or if provided perhaps we get away with it 
just being like 'bytes' * 8 and require lengths to be multiple of a byte.)

Let's look at an example of two interacting data stream transforms.

<xs:sequence
  daf:streamEncoding="utf-8" daf:streamTransform="foldedLines" 
daf:streamLengthKind="delimited">
  ...
  ... presumably everything here is textual, and utf-8 because foldedLines only 
applies sensibly to text.
  ...
  <xs:sequence daf:streamEncoding="us-ascii" daf:streamTransform="base64" 
daf:streamLengthKind="delimited" daf:streamTerminator="{ ../marker }">
      ...
      ... everything here is parsed against the bytes obtained from base64 
decoding
      ... which is itself decoding the output of the foldedLines transform
      ... above. Base64 requires only us-ascii, which is a subset of utf-8.
      ...
  </xs:sequence>
</xs:sequence>

Summary
* allows stacking transforms one on top of another. So you can have base64 
encoded compressed data as the payload representation of
a child element within a larger element.
* allows specifying properties of the underlying data stream separately from 
the properties of the logical data.
* scopes the transforms over a term (model-group or element)
* prevents inadvertent lexical scoping of a streamTransform from a lexically 
enclosing top level format annotation.


Implementation Notes:

Introduction of a stream transform basically appears in the Term grammar as a 
combinator that surrounds the contained Term contents.

Please review & discuss - draft proposal for how to do base64, foldedLines, etc.

Reply via email to