This email is to start a discussion of features to enable DFDL to express more data formats - particularly those that use some form of encoding (not charset encoding, algorithmic encoding) of part or all of the data.
IETF data formats make extensive use of base64 encoding of binary data for inclusion in textual data. In addition the textual formats make use of line-folding (A line longer than 72 characters is extended on the next line by beginning the next line with a space (or tab? not sure). There are many other schemes where part of a data representation has to be algorithmically decoded before the DFDL parsing can process it. A good example comes from the MIL-STD-2045 message header format. This header has flags that indicate whether the message contents is to be compressed, and with what compression algorithm. Parsing needs to choose among several algorithms based on values computed from the data. Unparsing similarly must determine which compression algorithm to use to compress the message contents. Our plan in implementing this feature in Daffodil would be to gain experience with it, and such time as we're satisfied with it, propose the feature for inclusion in a future revision of the DFDL standard. Perhaps there is a better name, but for this email we'll use the property dfdl:transferEncoding. This term comes from MIME where data can be transported encoded in a content transfer encoding designed to protect binary data from corruption, etc. What is proposed is: dfdl:transferEncoding takes a whitespace separated list of transfer encoding names. The empty string means no transfer encoding will be used. An expression can be used to evaluate to the whitespace separated list, or to the empty string. A transfer encoding name identifies a transfer encoding algorithm. This algorithm can be * bytes to bytes - example compress * bytes to text - TBD (needed?) * text to bytes - example base64, AIS * text to text - TBD (needed?) The whitespace separated list must be of compatible transfer encoding algorithms. The first named algorithm is applied first, so assuming these identifiers are valid dfdl:transferEncoding="base64 zip" would mean the data is text, and will be converted from text to bytes by the base64 decoder, and then from bytes to bytes by the unzip decoder. The inverse happens when unparsing. When a DFDL element has a dfdl:transferEncoding, then the length of that element is the length of the transfer- encoded representation of the data. For example: An element of complex type can have a prefixed length indicating it is 16457 bytes long. If its transfer encoding specifies zip compression, then this 16457 bytes would be unzipped and the result would be larger. For example it could expand to 50873 bytes. The content of the complex type would then be parsed from this 50873 bytes. The implementation of transfer encodings generally involves Daffodil's parser and unparser combinators. Considering first parsing. The combinator would take action before and after parsing the content of the element. In the before action, the Daffodil DataInputStream would be encapsulated by another implementation of DataInputStream; except that this encapsulating stream would implement the transfer encoding decoder algorithm, reading data from the underlying DataInputStream. Multiple transfer encodings would result in multiple such encapsulations layered one upon the other. After the content is unparsed, the action taken after by the combinator is to unencapsulate the DataInputStream, returning to the original DataInputStream, from which some data will have been consumed. The position of the original DataInputStream must be precise and exactly the position after the last bit of the transfer-encoded data. Some formats will require nested elements such that an outer element having a transfer encoding specified can have a text dfdl:encoding property specifying the text charset used in the transfer-encoded representation. The inner nested element can then have a different dfdl:encoding property - which is used to interpret the decoded data as text. For example suppose you have a large text string in UTF-8. This can be compressed to get bytes, and those bytes base64 encoded into the US-ASCII charset. This would be expressed by something like <element name="outer" dfdl:encoding="us-ascii" dfdl:transferEncoding="base64 compress"> <complexType> <sequence> <element name="inner" type="xs:string" dfdl:encoding="utf-8" dfdl:lengthKind="delimited"/> .... About extensibility It was a goal for this set of transfer encodings to be readily extensible. This is because many formats have specific encodings particular to them. AIS has one, ASN.1 BER has one (so called "object" encoding), and there are a wide variety of compression algorithms. However, it is probably best to build some of these transfer encoders/decoders first, and then consider what is necessary to specify one without access to Daffodil internal classes and data structures. About MIME names for encodings. TBD: identifiers like base64 mean different things in different contexts. In the XML world it is just an algorithm for creating a single long string of characters. (Much like how hexBinary means a single long string of hex digits). But in IETF Internet Message Format, base64 means a particular syntax with lines of a specific length. An IMF base64 encoded binary has a block structure with human-tolerable line-lengths (max 72) and a specific introduction and termination to indicate the start/end. Perhaps use QNames so that ietf:base64 or mime:base64 can provide the distinctions using normal namespace qualification. TBD: parameters to transfer encoding algorithms. We may need some way to express these. Perhaps a URL-style thing like dfdl:transferEncoding='compress?method=bz2' ...mike beckerle